SlideShare una empresa de Scribd logo
1 de 41
Large-scale Recommender
Systems on Just a PC
LSRS 2013 keynote
(RecSys ’13 Hong Kong)

Aapo Kyrölä
Ph.D. candidate @ CMU
http://www.cs.cmu.edu/~akyrola
Twitter: @kyrpov

Big Data – small machine
My Background
• Academic: 5th year Ph.D. @ Carnegie Mellon.
Advisors: Guy Blelloch, Carlos Guestrin (UW)
2009 

2012 

+ Shotgun : Parallel L1-regularized regression solver (ICML 2011).
+ Internships at MSR Asia (2011) and Twitter (2012)

• Startup Entrepreneur
Habbo : founded 2000
Outline of this talk
1. Why single-computer computing?
2. Introduction to graph computation and
GraphChi
3. Recommender systems with GraphChi
4. Future directions & Conclusion
Large-Scale Recommender Systems on
Just a PC

Why on a single machine?

Can’t we just use the
Cloud?
Why use a cluster?
Two reasons:
1. One computer cannot handle my problem in a
reasonable time.

1. I need to solve the problem very fast.
Why use a cluster?
Two reasons:
1. One computer cannot handle my problem in a
reasonable time.
Our work expands the space of feasible (graph) problems on
one machine:
- Our experiments use the same graphs, or bigger, than previous
papers on distributed graph computation. (+ we can do Twitter
graph on a laptop)
- Most data not that “big”.

1. I need to solve the problem very fast.
Our work raises the bar on required performance for a
“complicated” system.
Benefits of single machine systems
Assuming it can handle your big problems…
1. Programmer productivity
– Global state
– Can use “real data” for development

2. Inexpensive to install, administer, less
power.
3. Scalability.
Efficient Scaling
Distributed Graph
System
Task 7

Task 6

Task 5

Task 4

Task 3

Single-computer
system (capable of big tasks)

Task 2

Task 1
Task 2
Task 3
Task 4
Task 5
Task 6

Task 1

6 machines
(Significantly) less
than 2x throughput
with 2x machines
T11

T10

T9

T8

T7

T6

T5

T4

T3

T2

T1

Task 1
Exactly 2x 2
Task
Task 3
throughput with 2x
Task 4
machines 5
Task
Task 6
Task 10
Task 11
Task 12

12 machines
Time

T

Time

T
GRAPH COMPUTATION AND
GRAPHCHI
Why graphs for recommender systems?
• Graph = matrix: edge(u,v) = M[u,v]
– Note: always sparse graphs

• Intuitive, human-understandable
representation
– Easy to visualize and explain.

• Unifies collaborative filtering (typically matrix
based) with recommendation in social
networks.
– Random walk algorithms.

• Local view  vertex-centric computation
Vertex-Centric Computational Model
• Graph G = (V, E)
– directed edges: e = (source,
destination)
– each edge and vertex
associated with a value
(user-defined type)
– vertex and edge values can
be modified
• (structure modification also
supported)

A

B

Data

Data

Data

Data

Data

Data
Data

Data
Data

Data

GraphChi – Aapo Kyrola

12
Vertex-centric Programming
• “Think like a vertex”
• Popularized by the Pregel and GraphLab
projects
Data

Data

Data

Data

Data

{ // modify neighborhood }

Data
Data

Data

Data
Data

MyFunc(vertex)
What is GraphChi

Both in OSDI’12!
The Main Challenge of Disk-based
Graph Computation:
Random Access

<< 5-10 M random edges
/ sec to achieve
“reasonable
performance”
100s reads/writes per sec

~ 100K reads / sec (commodity)
~ 1M reads / sec (high-end arrays)
Details: Kyrola, Blelloch, Guestrin: “Large-scale graph computation on just a PC” (OSDI 2012)

Parallel Sliding Windows

or

Only P large reads for each interval (sub-graph).

P2 reads on one full pass.
GraphChi Program Execution
For T iterations:
For p=1 to P
For v in interval(p)
updateFunction(v)

For T iterations:
For v=1 to V
updateFunction(v)
“Asynchronous”: updates immediately
visible (vs. bulk-synchronous).
Performance

GraphChi can compute on the
full Twitter follow-graph with
just a standard laptop.
~ as fast as a very large Hadoop cluster!
(size of the graph Fall 2013, > 20B edges [Gupta et al 2013])
GraphChi is Open Source
• C++ and Java-versions in GitHub:
http://github.com/graphchi
– Java-version has a Hadoop/Pig wrapper.
• If you really really want to use Hadoop.
RECSYS MODEL TRAINING
WITH GRAPHCHI
Overview of Recommender Systems for
GraphChi
• Collaborative Filtering toolkit (next slide)
• Link prediction in large networks
– Random-walk based approaches (Twitter)
– Talk on Wednesday.
GraphChi’s Collaborative Filtering Toolkit
• Developed by Danny Bickson
(CMU / GraphLab Inc)
• Includes:
–
–
–
–
–
–
–
–

Alternative Least Squares (ALS)
Sparse-ALS
SVD++
LibFM (factorization machines)
GenSGD
Item-similarity based methods
PMF
CliMF (contributed by Mark
Levy)
– ….

See Danny’s blog for more
information:
http://bickson.blogspot.com
/2012/12/collaborativefiltering-with-graphchi.html
Note: In the C++ -version.
Java-version in development
by a CMU team.
TWO EXAMPLES: ALS AND
ITEM-BASED CF
Example: Alternative Least Squares
Matrix Factorization (ALS)
• Task: Predict ratings for items (movies) by
users.
• Model:
– Latent factor model (see next slide)

Reference: Y. Zhou, D. Wilkinson, R. Schreiber, R. Pan: “Large-Scale
Parallel Collaborative Filtering for the Netflix Prize” (2008)
ALS: Product – Item bipartite graph
0.4

2.3

-1.8

2.9

1.2

4

Women on the Verge of a
Nervous Breakdown
2.3

2.5

3.9

0.02

0.04

2.1

3.141

3
The Celebration
8.7

-3.2

2.8

0.9

0.2

2.9

0.04

City of God

4.1

2

Wild Strawberries
5
User’s rating of a movie modeled as a dot-product:
<factor(user), factor(movie)>
La Dolce Vita
ALS: GraphChi implementation
• Update function handles one vertex a time
(user or movie)
• For each user:
– Estimate latent(user): minimize least squares of
dot-product predicted ratings

• GraphChi executes the update function for
each vertex (in parallel), and loads edges
(ratings) from disk
– Latent factors in memory: need O(V) memory.
– If factors don’t fit in memory, can replicate to
edges. and thus store on disk
Scales to very large problems!
ALS: Performance
Matrix Factorization (Alternative Least Squares)
Netflix (99M edges), D=20

GraphChi (Mac
Mini)

GraphLab v1
(8 cores)

0

2

4

6

8

10

12

Minutes

Remark: Netflix is not a big problem, but
GraphChi will scale at most linearly with
input size (ALS is CPU bounded, so should
be sub-linear in #ratings).
Example: Item Based-CF
• Task: compute a similarity score [e,g.
Jaccard] for each movie-pair that has at least
one viewer in common.
– Similarity(X, Y) ~ # common viewers
– Output top K similar items for each item to a file.
– … or: create edge between X, Y containing the
similarity.

• Problem: enumerating all pairs takes too
much time.
Women on the Verge of a
Nervous Breakdown
3
Solution: Enumerate all
The Celebration
triangles of the graph.

New problem: how to
City of God
enumerate triangles if the
graph does not fit in RAM?
Wild Strawberries
La Dolce Vita
Enumerating Triangles (Item-CF)
• Triangles with edge (u, v) =
intersection(neighbors(u), neighbors(v))
• Iterative memory efficient solution (next
slide)
Algorithm:
• Let pivots be a subset of the vertices;
• Load all neighbor-lists (adjacency lists)
of pivots into RAM
• Use now GraphChi to load all vertices
from disk, one by one, and compare
their adjacency lists to the pivots’
adjacency lists (similar to merge).
• Repeat with a new subset of pivots.
PIVOTS
Triangle Counting Performance
Triangle Counting

twitter-2010 (1.5B edges)
GraphChi (Mac
Mini)
Hadoop (1636
machines)

0

50

100

150

200

250
Minutes

300

350

400

450
FUTURE DIRECTIONS & FINAL
REMARKS
Single-Machine Computing in
Production?
• GraphChi supports incremental
computation with dynamic graphs:
– Can keep on running indefinitely, adding new
edges to the graph  Constantly fresh model.
– However, requires engineering – not included
in the toolkit.

• Compare to a cluster-based system (such
as Hadoop) that needs to compute from
scratch.
Unified Recsys Platform for GraphChi?
• Working with masters students at CMU.
• Goal: ability to easily compare different
algorithms, parameters
– Unified input, output.
– General programmable API (not just file-based)
– Evaluation process: Several evaluation metrics;
Cross-validation, held-out data…
– Run many algorithm instances in parallel, on
same graph.
– Java.

• Scalable from the get-go.
DataDescriptor
data definition
column1 : categorical
column2: real
column3: key
column4: categorical

Input data

Algorithm X: Input
Algorithm Input Descriptor
map(input: DataDescriptor)

GraphChi
Preprocessor

aux
data

GraphChi Input
aux
data

Disk

GraphChi Input

Algorithm X Training
Program

Held-out
data (test
data)

Algorithm Y Training
Program

Algorithm X Predictor
training
metrics

test quality
metrics

Algorithm Z Training
Program
Recent developments: Disk-based Graph
Computation
• Recently two disk-based graph computation
systems published:
– TurboGraph (KDD’13)
– X-Stream (SOSP’13 in October)

• Significantly better performance than
GraphChi on many problems
– Avoid preprocessing (“sharding”)
– But GraphChi can do some computation that XStream cannot (triangle counting and related);
TurboGraph requires SSD
– Hot research area!
Do you need GraphChi – or any system?
• Heck, for many algorithms, you can just
mmap() over your (binary) adjacency list /
sparse matrix, and write a for-loop.
– See Lin, Chau, Kang Leveraging Memory Mapping for Fast and
Scalable Graph Computation on a PC (Big Data ’13)

• Obviously good to have a common API
– And some algos need more advanced
solutions (like GraphChi, XStream, TurboGraph)
Beware of the hype!
Conclusion
• Very large recommender algorithms can now
be run on just your PC or laptop.
– Additional performance from multi-core
parallelism.
– Great for productivity – scale by replicating.

• In general, good single machine scalability
requires care with data structures, memory
management  natural with C/C++, with
Java (etc.) need low-level byte massaging.
– Frameworks like GraphChi hide the low-level.

• More work needed to ‘’productize’’ current
work.
Thank you!

Aapo Kyrölä
Ph.D. candidate @ CMU – soon to
graduate! (Currently visiting U.W)
http://www.cs.cmu.edu/~akyrola
Twitter: @kyrpov

Más contenido relacionado

La actualidad más candente

Josh Patterson MLconf slides
Josh Patterson MLconf slidesJosh Patterson MLconf slides
Josh Patterson MLconf slidesMLconf
 
Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O Sri Ambati
 
Co-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and SparkCo-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and Sparksscdotopen
 
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...asimkadav
 
Start From A MapReduce Graph Pattern-recognize Algorithm
Start From A MapReduce Graph Pattern-recognize AlgorithmStart From A MapReduce Graph Pattern-recognize Algorithm
Start From A MapReduce Graph Pattern-recognize AlgorithmYu Liu
 
Bringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to MahoutBringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to Mahoutsscdotopen
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingNesreen K. Ahmed
 
Mahout scala and spark bindings
Mahout scala and spark bindingsMahout scala and spark bindings
Mahout scala and spark bindingsDmitriy Lyubimov
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
 
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...Menlo Systems GmbH
 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16MLconf
 
Parallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear ModelsParallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear ModelsRevolution Analytics
 
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...Nesreen K. Ahmed
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithmsSandeep Joshi
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLMLconf
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16MLconf
 
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016MLconf
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming AlgorithmsJoe Kelley
 
Relational Algebra and MapReduce
Relational Algebra and MapReduceRelational Algebra and MapReduce
Relational Algebra and MapReducePietro Michiardi
 
A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)Lynn Cherny
 

La actualidad más candente (20)

Josh Patterson MLconf slides
Josh Patterson MLconf slidesJosh Patterson MLconf slides
Josh Patterson MLconf slides
 
Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O
 
Co-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and SparkCo-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and Spark
 
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
 
Start From A MapReduce Graph Pattern-recognize Algorithm
Start From A MapReduce Graph Pattern-recognize AlgorithmStart From A MapReduce Graph Pattern-recognize Algorithm
Start From A MapReduce Graph Pattern-recognize Algorithm
 
Bringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to MahoutBringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to Mahout
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and Modeling
 
Mahout scala and spark bindings
Mahout scala and spark bindingsMahout scala and spark bindings
Mahout scala and spark bindings
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
 
Parallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear ModelsParallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear Models
 
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
 
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
 
Relational Algebra and MapReduce
Relational Algebra and MapReduceRelational Algebra and MapReduce
Relational Algebra and MapReduce
 
A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)
 

Destacado

Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSpark Summit
 
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015Till Rohrmann
 
Movie topics- Efficient features for movie recommendation systems
Movie topics- Efficient features for movie recommendation systemsMovie topics- Efficient features for movie recommendation systems
Movie topics- Efficient features for movie recommendation systemssuvir bhargav
 
CSTalks - Real movie recommendation - 9 Mar
CSTalks - Real movie recommendation - 9 MarCSTalks - Real movie recommendation - 9 Mar
CSTalks - Real movie recommendation - 9 Marcstalks
 
A Non-Intrusive Movie Recommendation System
A Non-Intrusive Movie Recommendation SystemA Non-Intrusive Movie Recommendation System
A Non-Intrusive Movie Recommendation SystemLaura Po
 
Developing and Movie Recommendation System in R
Developing and Movie Recommendation System in RDeveloping and Movie Recommendation System in R
Developing and Movie Recommendation System in RJody Schechter
 
Moviesion: Content-based Movie Recommender Fueled by Linked Open Data
Moviesion: Content-based Movie Recommender Fueled by Linked Open DataMoviesion: Content-based Movie Recommender Fueled by Linked Open Data
Moviesion: Content-based Movie Recommender Fueled by Linked Open DataHossein Fani
 
Recommender system
Recommender systemRecommender system
Recommender systemSaiguru P.v
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Libraryjeykottalam
 
Developing a Movie recommendation Engine with Spark
Developing a Movie recommendation Engine with SparkDeveloping a Movie recommendation Engine with Spark
Developing a Movie recommendation Engine with SparkEdureka!
 
Congelamiento de Precios - Productos en supermercados libertad
Congelamiento de Precios - Productos en supermercados libertadCongelamiento de Precios - Productos en supermercados libertad
Congelamiento de Precios - Productos en supermercados libertadDiario Elcomahueonline
 
Leccion i persona_y_organizacion
Leccion i persona_y_organizacionLeccion i persona_y_organizacion
Leccion i persona_y_organizacionrichard rivera
 
Lab safety 12_10_13
Lab safety 12_10_13Lab safety 12_10_13
Lab safety 12_10_13skwahl
 
2012 DuPage Environmental Summit
2012 DuPage Environmental Summit2012 DuPage Environmental Summit
2012 DuPage Environmental SummitNapervilleNCEC
 

Destacado (20)

Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya Hristakeva
 
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
 
Movie topics- Efficient features for movie recommendation systems
Movie topics- Efficient features for movie recommendation systemsMovie topics- Efficient features for movie recommendation systems
Movie topics- Efficient features for movie recommendation systems
 
CSTalks - Real movie recommendation - 9 Mar
CSTalks - Real movie recommendation - 9 MarCSTalks - Real movie recommendation - 9 Mar
CSTalks - Real movie recommendation - 9 Mar
 
A Non-Intrusive Movie Recommendation System
A Non-Intrusive Movie Recommendation SystemA Non-Intrusive Movie Recommendation System
A Non-Intrusive Movie Recommendation System
 
Developing and Movie Recommendation System in R
Developing and Movie Recommendation System in RDeveloping and Movie Recommendation System in R
Developing and Movie Recommendation System in R
 
Moviesion: Content-based Movie Recommender Fueled by Linked Open Data
Moviesion: Content-based Movie Recommender Fueled by Linked Open DataMoviesion: Content-based Movie Recommender Fueled by Linked Open Data
Moviesion: Content-based Movie Recommender Fueled by Linked Open Data
 
Recommender system
Recommender systemRecommender system
Recommender system
 
Recommendation engine
Recommendation engineRecommendation engine
Recommendation engine
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 
Developing a Movie recommendation Engine with Spark
Developing a Movie recommendation Engine with SparkDeveloping a Movie recommendation Engine with Spark
Developing a Movie recommendation Engine with Spark
 
Congelamiento de Precios - Productos en supermercados libertad
Congelamiento de Precios - Productos en supermercados libertadCongelamiento de Precios - Productos en supermercados libertad
Congelamiento de Precios - Productos en supermercados libertad
 
Leccion i persona_y_organizacion
Leccion i persona_y_organizacionLeccion i persona_y_organizacion
Leccion i persona_y_organizacion
 
Publicidad
PublicidadPublicidad
Publicidad
 
Seminario..
Seminario..Seminario..
Seminario..
 
Thalia
ThaliaThalia
Thalia
 
GANGA
GANGAGANGA
GANGA
 
Lab safety 12_10_13
Lab safety 12_10_13Lab safety 12_10_13
Lab safety 12_10_13
 
2012 DuPage Environmental Summit
2012 DuPage Environmental Summit2012 DuPage Environmental Summit
2012 DuPage Environmental Summit
 
Cascalog
CascalogCascalog
Cascalog
 

Similar a Large-scale Recommendation Systems on Just a PC

Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2Mohit Garg
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learningArnaud Rachez
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupDoug Needham
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science ChallengeMark Nichols, P.E.
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaGoDataDriven
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query ExecutionJ Singh
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkDB Tsai
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Automated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with DaskAutomated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with DaskASI Data Science
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poliivascucristian
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreKelly Technologies
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedTuri, Inc.
 
AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...Institute of Contemporary Sciences
 
The Future of Computing is Distributed
The Future of Computing is DistributedThe Future of Computing is Distributed
The Future of Computing is DistributedAlluxio, Inc.
 

Similar a Large-scale Recommendation Systems on Just a PC (20)

Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learning
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Automated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with DaskAutomated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with Dask
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and Distributed
 
Pydata talk
Pydata talkPydata talk
Pydata talk
 
AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...
 
The Future of Computing is Distributed
The Future of Computing is DistributedThe Future of Computing is Distributed
The Future of Computing is Distributed
 

Último

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 

Último (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Large-scale Recommendation Systems on Just a PC

  • 1. Large-scale Recommender Systems on Just a PC LSRS 2013 keynote (RecSys ’13 Hong Kong) Aapo Kyrölä Ph.D. candidate @ CMU http://www.cs.cmu.edu/~akyrola Twitter: @kyrpov Big Data – small machine
  • 2. My Background • Academic: 5th year Ph.D. @ Carnegie Mellon. Advisors: Guy Blelloch, Carlos Guestrin (UW) 2009  2012  + Shotgun : Parallel L1-regularized regression solver (ICML 2011). + Internships at MSR Asia (2011) and Twitter (2012) • Startup Entrepreneur Habbo : founded 2000
  • 3. Outline of this talk 1. Why single-computer computing? 2. Introduction to graph computation and GraphChi 3. Recommender systems with GraphChi 4. Future directions & Conclusion
  • 4. Large-Scale Recommender Systems on Just a PC Why on a single machine? Can’t we just use the Cloud?
  • 5. Why use a cluster? Two reasons: 1. One computer cannot handle my problem in a reasonable time. 1. I need to solve the problem very fast.
  • 6. Why use a cluster? Two reasons: 1. One computer cannot handle my problem in a reasonable time. Our work expands the space of feasible (graph) problems on one machine: - Our experiments use the same graphs, or bigger, than previous papers on distributed graph computation. (+ we can do Twitter graph on a laptop) - Most data not that “big”. 1. I need to solve the problem very fast. Our work raises the bar on required performance for a “complicated” system.
  • 7. Benefits of single machine systems Assuming it can handle your big problems… 1. Programmer productivity – Global state – Can use “real data” for development 2. Inexpensive to install, administer, less power. 3. Scalability.
  • 8. Efficient Scaling Distributed Graph System Task 7 Task 6 Task 5 Task 4 Task 3 Single-computer system (capable of big tasks) Task 2 Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 1 6 machines (Significantly) less than 2x throughput with 2x machines T11 T10 T9 T8 T7 T6 T5 T4 T3 T2 T1 Task 1 Exactly 2x 2 Task Task 3 throughput with 2x Task 4 machines 5 Task Task 6 Task 10 Task 11 Task 12 12 machines Time T Time T
  • 9.
  • 11. Why graphs for recommender systems? • Graph = matrix: edge(u,v) = M[u,v] – Note: always sparse graphs • Intuitive, human-understandable representation – Easy to visualize and explain. • Unifies collaborative filtering (typically matrix based) with recommendation in social networks. – Random walk algorithms. • Local view  vertex-centric computation
  • 12. Vertex-Centric Computational Model • Graph G = (V, E) – directed edges: e = (source, destination) – each edge and vertex associated with a value (user-defined type) – vertex and edge values can be modified • (structure modification also supported) A B Data Data Data Data Data Data Data Data Data Data GraphChi – Aapo Kyrola 12
  • 13. Vertex-centric Programming • “Think like a vertex” • Popularized by the Pregel and GraphLab projects Data Data Data Data Data { // modify neighborhood } Data Data Data Data Data MyFunc(vertex)
  • 14. What is GraphChi Both in OSDI’12!
  • 15. The Main Challenge of Disk-based Graph Computation: Random Access << 5-10 M random edges / sec to achieve “reasonable performance” 100s reads/writes per sec ~ 100K reads / sec (commodity) ~ 1M reads / sec (high-end arrays)
  • 16. Details: Kyrola, Blelloch, Guestrin: “Large-scale graph computation on just a PC” (OSDI 2012) Parallel Sliding Windows or Only P large reads for each interval (sub-graph). P2 reads on one full pass.
  • 17. GraphChi Program Execution For T iterations: For p=1 to P For v in interval(p) updateFunction(v) For T iterations: For v=1 to V updateFunction(v) “Asynchronous”: updates immediately visible (vs. bulk-synchronous).
  • 18. Performance GraphChi can compute on the full Twitter follow-graph with just a standard laptop. ~ as fast as a very large Hadoop cluster! (size of the graph Fall 2013, > 20B edges [Gupta et al 2013])
  • 19. GraphChi is Open Source • C++ and Java-versions in GitHub: http://github.com/graphchi – Java-version has a Hadoop/Pig wrapper. • If you really really want to use Hadoop.
  • 21. Overview of Recommender Systems for GraphChi • Collaborative Filtering toolkit (next slide) • Link prediction in large networks – Random-walk based approaches (Twitter) – Talk on Wednesday.
  • 22. GraphChi’s Collaborative Filtering Toolkit • Developed by Danny Bickson (CMU / GraphLab Inc) • Includes: – – – – – – – – Alternative Least Squares (ALS) Sparse-ALS SVD++ LibFM (factorization machines) GenSGD Item-similarity based methods PMF CliMF (contributed by Mark Levy) – …. See Danny’s blog for more information: http://bickson.blogspot.com /2012/12/collaborativefiltering-with-graphchi.html Note: In the C++ -version. Java-version in development by a CMU team.
  • 23. TWO EXAMPLES: ALS AND ITEM-BASED CF
  • 24. Example: Alternative Least Squares Matrix Factorization (ALS) • Task: Predict ratings for items (movies) by users. • Model: – Latent factor model (see next slide) Reference: Y. Zhou, D. Wilkinson, R. Schreiber, R. Pan: “Large-Scale Parallel Collaborative Filtering for the Netflix Prize” (2008)
  • 25. ALS: Product – Item bipartite graph 0.4 2.3 -1.8 2.9 1.2 4 Women on the Verge of a Nervous Breakdown 2.3 2.5 3.9 0.02 0.04 2.1 3.141 3 The Celebration 8.7 -3.2 2.8 0.9 0.2 2.9 0.04 City of God 4.1 2 Wild Strawberries 5 User’s rating of a movie modeled as a dot-product: <factor(user), factor(movie)> La Dolce Vita
  • 26. ALS: GraphChi implementation • Update function handles one vertex a time (user or movie) • For each user: – Estimate latent(user): minimize least squares of dot-product predicted ratings • GraphChi executes the update function for each vertex (in parallel), and loads edges (ratings) from disk – Latent factors in memory: need O(V) memory. – If factors don’t fit in memory, can replicate to edges. and thus store on disk Scales to very large problems!
  • 27. ALS: Performance Matrix Factorization (Alternative Least Squares) Netflix (99M edges), D=20 GraphChi (Mac Mini) GraphLab v1 (8 cores) 0 2 4 6 8 10 12 Minutes Remark: Netflix is not a big problem, but GraphChi will scale at most linearly with input size (ALS is CPU bounded, so should be sub-linear in #ratings).
  • 28. Example: Item Based-CF • Task: compute a similarity score [e,g. Jaccard] for each movie-pair that has at least one viewer in common. – Similarity(X, Y) ~ # common viewers – Output top K similar items for each item to a file. – … or: create edge between X, Y containing the similarity. • Problem: enumerating all pairs takes too much time.
  • 29. Women on the Verge of a Nervous Breakdown 3 Solution: Enumerate all The Celebration triangles of the graph. New problem: how to City of God enumerate triangles if the graph does not fit in RAM? Wild Strawberries La Dolce Vita
  • 30. Enumerating Triangles (Item-CF) • Triangles with edge (u, v) = intersection(neighbors(u), neighbors(v)) • Iterative memory efficient solution (next slide)
  • 31. Algorithm: • Let pivots be a subset of the vertices; • Load all neighbor-lists (adjacency lists) of pivots into RAM • Use now GraphChi to load all vertices from disk, one by one, and compare their adjacency lists to the pivots’ adjacency lists (similar to merge). • Repeat with a new subset of pivots. PIVOTS
  • 32. Triangle Counting Performance Triangle Counting twitter-2010 (1.5B edges) GraphChi (Mac Mini) Hadoop (1636 machines) 0 50 100 150 200 250 Minutes 300 350 400 450
  • 33. FUTURE DIRECTIONS & FINAL REMARKS
  • 34. Single-Machine Computing in Production? • GraphChi supports incremental computation with dynamic graphs: – Can keep on running indefinitely, adding new edges to the graph  Constantly fresh model. – However, requires engineering – not included in the toolkit. • Compare to a cluster-based system (such as Hadoop) that needs to compute from scratch.
  • 35. Unified Recsys Platform for GraphChi? • Working with masters students at CMU. • Goal: ability to easily compare different algorithms, parameters – Unified input, output. – General programmable API (not just file-based) – Evaluation process: Several evaluation metrics; Cross-validation, held-out data… – Run many algorithm instances in parallel, on same graph. – Java. • Scalable from the get-go.
  • 36. DataDescriptor data definition column1 : categorical column2: real column3: key column4: categorical Input data Algorithm X: Input Algorithm Input Descriptor map(input: DataDescriptor) GraphChi Preprocessor aux data GraphChi Input
  • 37. aux data Disk GraphChi Input Algorithm X Training Program Held-out data (test data) Algorithm Y Training Program Algorithm X Predictor training metrics test quality metrics Algorithm Z Training Program
  • 38. Recent developments: Disk-based Graph Computation • Recently two disk-based graph computation systems published: – TurboGraph (KDD’13) – X-Stream (SOSP’13 in October) • Significantly better performance than GraphChi on many problems – Avoid preprocessing (“sharding”) – But GraphChi can do some computation that XStream cannot (triangle counting and related); TurboGraph requires SSD – Hot research area!
  • 39. Do you need GraphChi – or any system? • Heck, for many algorithms, you can just mmap() over your (binary) adjacency list / sparse matrix, and write a for-loop. – See Lin, Chau, Kang Leveraging Memory Mapping for Fast and Scalable Graph Computation on a PC (Big Data ’13) • Obviously good to have a common API – And some algos need more advanced solutions (like GraphChi, XStream, TurboGraph) Beware of the hype!
  • 40. Conclusion • Very large recommender algorithms can now be run on just your PC or laptop. – Additional performance from multi-core parallelism. – Great for productivity – scale by replicating. • In general, good single machine scalability requires care with data structures, memory management  natural with C/C++, with Java (etc.) need low-level byte massaging. – Frameworks like GraphChi hide the low-level. • More work needed to ‘’productize’’ current work.
  • 41. Thank you! Aapo Kyrölä Ph.D. candidate @ CMU – soon to graduate! (Currently visiting U.W) http://www.cs.cmu.edu/~akyrola Twitter: @kyrpov

Notas del editor

  1. This talk has two main goals: 1) to little bit challenge how we think about scalability: in this case, show how just a single machine, a Mac Mini, can solve very big problems – that people often use something like Hadoop for; 2) to talk about GraphChi, which is my research system and show how to implement rec sys for that.
  2. HOW MANY KNOW GRAPHLAB? So because of my industry experience, on working with very large systems, I always focus on very practical solutions. And it is because of this experience of working with distributed systems, that I really understand the benefits of avoiding it!
  3. Let me ask it otherway round. Why would you want to use a cluster?Most people do not have multi-tera or petabyte datasets.
  4. Let me ask it otherway round. Why would you want to use a cluster?
  5. This is a made-up example to illustrate a point. Relate to netflix off-line.Here we have chosen T to be the time the single machine system, such as GraphChi, solves the one task. Let’s assume the cluster system needs 6 machines to solve the problem, and does it about 7 times faster than GraphChi. Then in Time T it solves 7 tasks while GraphChi solves 6 tasks with the same cluster.Now if we double the size of the cluster, to twelve machines: cluster systems never have linear speedup, so let’s assume the performance increases by say 50%. Of course this is just fake numbers, but similar behavior happens at some cut-off point anyway. Now GraphChi will solve exactly twice the number of tasks in time T.
  6. We are not only ones thinking this way…Add MSR paper?
  7. Let’s now discuss what is the computational setting of this work. Let’s first introduce the basic computational model.
  8. Note about edge-centric?
  9. So as a recap, GraphChi is a disk-based GraphLab. While GraphLab2 is incredibly powerful on big clusters, or in the cloud, you can use GraphChi to solve as big problems on just a Mac Mini. Of course, GraphLab can solve the problems way faster – but I believe GraphChi provides performance that is more then enough for many. Spin-off of GraphLab projectDisk based GraphLabOSDI’12
  10. I will now briefly demonstrate why disk-based graph computation was not a trivial problem. Perhaps we can assume it wasn’t, because no such system as stated in the goals clearly existed. But it makes sense to analyze why solving the problem required a small innovation, worthy of an OSDI publication. The main problem has been stated on the slide: random access, i.e when you need to read many times from many different locations on disk, is slow. This is especially true with hard drives: seek times are several milliseconds. On SSD, random access is much faster, but still far a far cry from the performance of RAM. Let’s now study this a bit.
  11. So how does GraphChi work? I don’t have time to go to details now. It is based on an algorithm we invented called Parallel Sliding Windows. In this model you split the graph in to P shards, and the graph is processed in P parts. For each part you load one shard completely in to memory, and load continuous chunks of data from the other shards. All in all, you need very small number of random accesses, which are the bottleneck of disk based computing. GraphChi is good on both SSD and hard drive!
  12. Another, perhaps a bit surprising motivation comes from thinking about scalability in large scale.The industry wants to compute many tasks on the same graph. For example, to compute personalizedRecommendations, same task is computed for people in different countries, different interests groups, etc.Currently: you need a cluster just to compute one single task. To compute tasks faster, you grow the cluster.But this work allows a different way. Since one machine can handle one big task, you can dedicate one taskPer machine.Why does this make sense? * Clusters are complex, and expensive to scale. * while in this new model, it is very simple as nodes do not talk to each other, and you can double the throughput by doubling the machinesThere are other motivations as well, such as reducing costs and energy. But let’s move on.
  13. Single machine systems are easy to programBut currently need specialized solutions while if you use Hadoop etc., you can use same framework for wide variety of problems