Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Wei Tan, IBM T. J. Watson Research Center
wtan@us.ibm.com
CuMF: Large-Scale Matrix Factorization on
Just One Machine with ...
Agenda
 Why accelerate recommendation (matrix factorization) using GPUs?
 What are the challenges?
 How cuMF tackles th...
Why we need a fast and scalable recommender system?
3
 Recommendation is pervasive
 Drive 80% of Netflix’s watch hours1
...
ALS (alternating least square) for collaborative filtering
4
 Input: users ratings on some items
 Output: all missing ra...
Matrix factorization/ALS is versatile
5
item
X
xu
ΘT θv
Recommendation
user
X
ΘT
word
wordWord
embedding
X
ΘT
word
documen...
Challenges of fast and scalable ALS
6
• ALS needs to solve:
spmm: cublas
• Challenge 1: access and
aggregate many θvs:
irr...
Challenge 1: memory access
7
 Nvidia K40: Memory BW: 288 GB/sec, compute: 4.3 Tflops/sec
 Higher flops  higher op inten...
Address challenge 1: memory-optimized ALS
8
• To obtain
1. Reuse θv s for many users
2. Load a portion into smem
3. Tile a...
Address challenge 1: memory-optimized ALS
9
f
f
T
T
T
T
Address challenge 2: scale-up ALS on multiple GPUs
 Model parallel: solve a
portion of the model
10
model
parallel
Address challenge 2: scale-up ALS on multiple GPUs
 Data parallel: solve
with a portion of the
training data
11
Data para...
Address challenge 2: parallel reduction
 Data parallel needs cross-GPU reduction
12
One-phase parallel reduction. Two-pha...
Recap: cuMF tackled two challenges
13
• ALS needs to solve:
spmm: cublas
• Challenge 1: access and
aggregate many θvs:
irr...
Connect cuMF to Spark MLlib
14
 Spark applications relying on mllib/ALS need no change
 Modified mllib/ALS detects GPU a...
Connect cuMF to Spark MLlib
1 Power 8 node + 2 K40
CUDA
kernel
GPU1
RDD
CUDA
kernel
…RDD RDD RDD
CUDA
kernel
GPU2
RDD
CUDA...
Implementation
16
 In C (circa. 10k LOC)
 CUDA 7.0/7.5, GCC OpenMP v3.0
 Baselines:
– Libmf: SGD on 1 node [RecSys14]
–...
CuMF performance
17
X-axis: time in seconds; Y-axis: Root Mean Square Error (RMSE) on test set
• 1 GPU vs. 30 cores, CuMF ...
Effectiveness of memory optimization
18
X-axis: time in seconds; Y-axis: Root Mean Square Error (RMSE) on test set
• Aggre...
CuMF performance and cost
19
• CuMF @4 GPUs ≈ NOMAD @64 HPC nodes
≈ 10x NOMAD @32 AWS nodes
• CuMF @4 GPUs ≈ 10x SparkALS ...
CuMF accelerated Spark on Power 8
20
• CuMF @2 K40s achieves 6+x speedup in training (193 sec  1.3k sec)
*GUI designed by...
Conclusion
 Why accelerate recommendation (matrix factorization) using GPU?
– Need to be fast, scalable and economic
 Wh...
Wei Tan, IBM T. J. Watson Research Center
wtan@us.ibm.com
http://ibm.biz/wei_tan
Thank you, questions?
Faster and Cheaper:...
Próxima SlideShare
Cargando en…5
×

S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

872 visualizaciones

Publicado el

Matrix factorization (MF) is used by many popular algorithms, e.g., collaborative filtering. GPU with massive cores and high intra-chip memory bandwidth sheds light on accelerating MF much further when appropriately exploiting its architectural characteristics.
In this talk I will introduce cuMF, a CUDA-based matrix factorization library that optimizes alternate least square (ALS) method to solve very large-scale MF. CuMF uses a set of techniques to maximize the performance on single and multiple GPUs. These techniques include smart access of sparse data leveraging GPU memory hierarchy, using data parallelism in conjunction with model parallelism, minimizing the communication overhead among GPUs, and a novel topology-aware parallel reduction scheme.
With only a single machine with four Nvidia GPU cards, cuMF can be 6-10 times as fast, and 33-100 times as cost-efficient, compared with the state-of-art distributed CPU solutions. Moreover, this cuMF can solve the largest matrix factorization problem ever reported yet in current literature. We also use cuMF to accelerate the ALS implementation in Spark MLlib.
A paper on CuMF is to be published at HPDC 2016 with a pre-print at http://arxiv.org/abs/1603.03820.

Publicado en: Internet
  • Sé el primero en comentar

S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

  1. 1. Wei Tan, IBM T. J. Watson Research Center wtan@us.ibm.com CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs
  2. 2. Agenda  Why accelerate recommendation (matrix factorization) using GPUs?  What are the challenges?  How cuMF tackles the challenges?  So what is the result? 2
  3. 3. Why we need a fast and scalable recommender system? 3  Recommendation is pervasive  Drive 80% of Netflix’s watch hours1  Digital ads market in US: 37.3 billion2  Need: fast, scalable, economic
  4. 4. ALS (alternating least square) for collaborative filtering 4  Input: users ratings on some items  Output: all missing ratings  How: factorize the rating matrix R into and minimize the lost on observed ratings:  ALS: iteratively solve R X xu ΘT θv
  5. 5. Matrix factorization/ALS is versatile 5 item X xu ΘT θv Recommendation user X ΘT word wordWord embedding X ΘT word documentTopic model a very important algorithm to accelerate
  6. 6. Challenges of fast and scalable ALS 6 • ALS needs to solve: spmm: cublas • Challenge 1: access and aggregate many θvs: irregular (R is sparse) and memory intensive LU or Cholesky decomposition: cublas • Challenge 2: Single GPU can NOT handle big m, n and nnz
  7. 7. Challenge 1: memory access 7  Nvidia K40: Memory BW: 288 GB/sec, compute: 4.3 Tflops/sec  Higher flops  higher op intensity (more flops per byte)  caching! Operational intensity (Flops/Byte) Gflops/s 4.3T 1 288G × 15 × under-utilized fully-utilized
  8. 8. Address challenge 1: memory-optimized ALS 8 • To obtain 1. Reuse θv s for many users 2. Load a portion into smem 3. Tile and aggregate in register
  9. 9. Address challenge 1: memory-optimized ALS 9 f f T T T T
  10. 10. Address challenge 2: scale-up ALS on multiple GPUs  Model parallel: solve a portion of the model 10 model parallel
  11. 11. Address challenge 2: scale-up ALS on multiple GPUs  Data parallel: solve with a portion of the training data 11 Data parallel model parallel
  12. 12. Address challenge 2: parallel reduction  Data parallel needs cross-GPU reduction 12 One-phase parallel reduction. Two-phase parallel reduction. Inter-socketIntra-socket
  13. 13. Recap: cuMF tackled two challenges 13 • ALS needs to solve: spmm: cublas • Challenge 1: access and aggregate many θvs: irregular (R is sparse) and memory intensive LU or Cholesky decomposition: cublas • Challenge 2: Single GPU can NOT handle big m, n and nnz
  14. 14. Connect cuMF to Spark MLlib 14  Spark applications relying on mllib/ALS need no change  Modified mllib/ALS detects GPU and offload computation  Leverage the best of Spark (scale-out) and GPU (scale-up) ALS apps mllib/ALS cuMFJNI
  15. 15. Connect cuMF to Spark MLlib 1 Power 8 node + 2 K40 CUDA kernel GPU1 RDD CUDA kernel …RDD RDD RDD CUDA kernel GPU2 RDD CUDA kernel …RDD RDD RDD shuffle 1 Power 8 node + 2 K40 CUDA kernel GPU1 RDD CUDA kernel …RDD RDD RDD shuffle …  RDD on CPU: to distribute rating data and shuffle parameters  Solver on GPU: to form and solve  Able to run on multiple nodes, and multiple GPUs per node 15
  16. 16. Implementation 16  In C (circa. 10k LOC)  CUDA 7.0/7.5, GCC OpenMP v3.0  Baselines: – Libmf: SGD on 1 node [RecSys14] – NOMAD: SGD on >1 nodes [VLDB 14] – SparkALS: ALS on Spark – FactorBird: SGD + parameter server for MF – Facebook: enhanced Giraph
  17. 17. CuMF performance 17 X-axis: time in seconds; Y-axis: Root Mean Square Error (RMSE) on test set • 1 GPU vs. 30 cores, CuMF slightly faster than libmf and NOMAD • CuMF scales well on 1, 2, 4 GPUs
  18. 18. Effectiveness of memory optimization 18 X-axis: time in seconds; Y-axis: Root Mean Square Error (RMSE) on test set • Aggressively using registers  2x • Using texture  25%-35% faster
  19. 19. CuMF performance and cost 19 • CuMF @4 GPUs ≈ NOMAD @64 HPC nodes ≈ 10x NOMAD @32 AWS nodes • CuMF @4 GPUs ≈ 10x SparkALS @50 nodes ≈ 1% of its cost
  20. 20. CuMF accelerated Spark on Power 8 20 • CuMF @2 K40s achieves 6+x speedup in training (193 sec  1.3k sec) *GUI designed by Amir Sanjar
  21. 21. Conclusion  Why accelerate recommendation (matrix factorization) using GPU? – Need to be fast, scalable and economic  What are the challenges? – Memory access, scale to multiple GPUs  How cuMF tackles the challenges? – Optimize memory access, parallelism and communication  So what is the result? – Up to 10x as fast, 100x as cost-efficient – Use cuMF standalone or with Spark – GPU can tackle ML problems beyond deep learning! 21
  22. 22. Wei Tan, IBM T. J. Watson Research Center wtan@us.ibm.com http://ibm.biz/wei_tan Thank you, questions? Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs. Wei Tan, Liangliang Cao, Liana Fong. HPDC 2016, http://arxiv.org/abs/1603.03820 Source code available soon.

×