SlideShare una empresa de Scribd logo
1 de 31
Descargar para leer sin conexión
A NumPy-compatible Library for GPU
Shohei Hido
VP of Research
Preferred Networks
Preferred Networks: An AI Startup in Japan
● Founded: March 2014 (120 engineers and researchers)
● Major news
● $100+M investment from Toyota for autonomous driving
● 2nd place at Amazon Robotics Challenge 2016
● Fastest ImageNet training on GPU cluster (15 minutes using 1,024 GPUs)
2
Deep learning research Industrial applications
Manufacturing
Automotive
Healthcare
Key takeaways
● CuPy is an open-source NumPy for NVIDIA GPU
● Python users can easily write CPU/GPU-agnostic code
● Existing NumPy code can be accelerated thanks to GPU and CUDA libraries
● What is CuPy
● Example: CPU/GPU agnostic implementation of k-means
● Introduction to CuPy
● Recent updates & conclusion
CuPy: A NumPy-Compatible Library for NVIDIA GPU
● NumPy is extensively used in Python but GPU is not supported
● GPU is getting faster and more important for scientific computing
import numpy as np
x_cpu = np.random.rand(10)
W_cpu = np.random.rand(10, 5)
y_cpu = np.dot(x_cpu, W_cpu)
import cupy as cp
x_gpu = cp.random.rand(10)
W_gpu = cp.random.rand(10, 5)
y_gpu = cp.dot(x_gpu, W_gpu)
y_gpu = cp.asarray(y_cpu)
y_cpu = cp.asnumpy(y_gpu)
for xp in [numpy, cupy]:
x = xp.random.rand(10)
W = xp.random.rand(10, 5)
y = xp.dot(x, W)
CPU/GPU-agnostic
NVIDIA GPUCPU
CuPy is actively developed (1,600+ github stars, 11,000+ commits)
Ryosuke Okuta
CTO
Preferred
Networks
Deep learning framework
https://chainer.org/
Probabilistic and graphical modeling
https://github.com/jmschrei/pomegranate
Natural language processing
https://spacy.io/
Python libraries powered by CuPy
Reputation (1/2): Travis Oliphant, creator of NumPy and SciPy
Reputation (2/2): Stephan Merity of Salesforce Research (MetaMind)
Our mission: make CuPy the default tool for GPU computation in Python
https://anaconda.org/anaconda/cupy/
● CuPy is now available on Anaconda in collaboration w/ Anaconda team
● You can install cupy with “$ conda install cupy” on Linux 64-bit
● We are working on Windows version
Don’t have GPU for CuPy? Google Colaboratory gives you one (for free!)
…
● What is CuPy
● Example: CPU/GPU agnostic implementation of k-means
● Introduction to CuPy
● Recent updates & conclusion
Implementation of CPU/GPU agnostic k-means fit(): 37 lines
https://github.com/cupy/cupy/blob/master/examples/kmeans/kmeans.py
K-means (1/3): Call function and initialization
● fit() follows the training API of scikit-learn
● xp represents either numpy or cupy
● Cluster centers are initialized by positions of
random samples
<- Specify NumPy or CuPy
K-means (2/3): Calculate distance to all of the cluster centers
● xp.linalg.norm is to compute the distance and
supported both in numpy and cupy
● _fit_calc_distances() uses custom kernel on cupy
Customized kernel with C++ snippet in cupy.ElementwiseKernel
● A kernel is generated by element-wise operation defined in C++ snippet
K-means (3/3): Update positions of cluster centers
● xp.stack is to update the cluster centers and
supported both in numpy and cupy
● _fit_calc_center() is also custom kernel based
Another element-wise kernel
● It just adds all of the points inside each cluster and count the number
● What is CuPy
● Example: CPU/GPU agnostic implementation of k-means
● Introduction to CuPy
● Recent updates & conclusion
Performance comparison with NumPy
● CuPy is faster than NumPy even in simple manipulation of large matrix
Benchmark code
Size CuPy [ms] NumPy [ms]
10^4 0.58 0.03
10^5 0.97 0.20
10^6 1.84 2.00
10^7 12.48 55.55
10^8 84.73 517.17
Benchmark result
6x faster
● Data types (dtypes)
○ bool_, int8, int16, int32, int64, uint8, uint16,
uint32, uint64, float16, float32, float64,
complex64, and complex128
● All basic indexing
○ indexing by ints, slices, newaxes, and Ellipsis
● Most of advanced indexing
○ except indexing patterns with boolean
masks
● Most of the array creation routines
○ empty, ones_like, diag, etc...
● Most of the array manipulation routines
○ reshape, rollaxis, concatenate, etc...
● All operators with broadcasting
● All universal functions for element-wise
operations
○ except those for complex numbers
● Linear algebra functions accelerated by cuBLAS
○ including product: dot, matmul, etc...
○ including decomposition: cholesky, svd,
etc...
● Reduction along axes
○ sum, max, argmax, etc...
● Sort operations implemented by Thrust
○ sort, argsort, and lexsort
● Sparse matrix accelerated by cuSPARSE
Compatibility with NumPy
Comparison with other Python libraries for/on CUDA
● CuPy is the only library that is designed for high compatibility with NumPy
still allowing users to write customized CUDA kernels for better performance
CuPy PyCUDA MinPy*
NVIDIA CUDA support ✔ ✔ ✔
CPU/GPU agnostic coding ✔ ✔
Automatic gradient support ** ✔
NumPy compatible interface ✔ ✔
User-defined CUDA kernel ✔ ✔
* https://github.com/dmlc/minpy
** Autograd is supported by Chainer
Inside CuPy
● CuPy extensively relies on NVIDIA libraries for better performance
Linear algebra
NVIDIA GPU
CUDA
cuDNN cuBLAS cuRANDcuSPARSE
NCCL
Thrust
Sparse matrix
DNN
Utility
Random
numbers
cuSOLVER
User-
defined
CUDA
kernel
Multi-
GPU
data
transfer
Sort
CuPy
Looks very easy?
● CUDA and its libraries are not designed for Python nor NumPy
━ CuPy is not just a wrapper of CUDA libraries for Python
━ CuPy is a fast numerical computation library on GPU with NumPy-compatible API
● NumPy specification is not documented
━ We have carefully investigated some unexpected behaviors of NumPy
━ CuPy tries to replicate NumPy’s behavior as much as possible
● NumPy’s behaviors vary between different versions
━ e.g, NumPy v1.14 changed the output format of __str__
• `[ 0. 1.]` -> `[0. 1.]` (no space)
Advanced features of CuPy (1/2)
Memory pool GPU Memory profiler
Function name
Used
Bytes
Acquired
Bytes
Occurrence
LinearFunction 5.16GB 0.18GB 3900
ReLU 0.99GB 0.46GB 1300
SoftMaxEnropy 7.71MB 5.08MB 1300
Accuracy 0.62MB 0.35MB 700
● This enables function-wise memory
profiling on Chainer
● Avoiding cudaMalloc is a
common practice in CUDA
programming
● CuPy supports memory pooling
using Best-Fit with Coalescing
(BFC) algorithm
● It reduces memory usage
to 25% on seq2seq model
Advanced features of CuPy (2/2)
Kernel fusion (experimental)
@cp.fuse()
def fused_func(x, y, z):
return (x * y) + z
● By adding decorator @cp.fuse(),
CuPy stores a series of operations
● Then it compiles a single kernel
to execute the operations
● What is CuPy
● Example: CPU/GPU agnostic implementation of k-means
● Introduction to CuPy
● Recent updates & conclusion
• Start providing pre-built wheel packages of CuPy
– cupy-cuda80, cupy-cuda90, and cupy-cuda91
– $ pip install cupy-cuda80
• Memory pool is now the default allocator
– Added line memory profiler using memory hook and traceback
• CUDA stream is fully supported
stream = cupy.cuda.stream.Stream()
with stream:
y = cupy.linalg.norm(x)
stream.synchronize()
stream = cupy.cuda.stream.Stream()
stream.use()
y = cupy.linalg.norm(x)
stream.synchronize()
What’s new in CuPy v4?
cupy.argpartition
cupy.unravel_index
cupy.percentile
cupy.moveaxis
cupy.blackman
cupy.hamming
cupy.hanning
cupy.isclose
cupy.iscomplex
cupy.iscomplexobj
cupy.isfortran
cupy.isreal
cupy.isrealobj
cupy.linalg.tensorinv
cupy.random.shuffle
cupy.random.set_random_state
cupy.random.RandomState.tomaxint
cupy.sparse.random
cupy.sparse.csr_matrix.eliminate_zeros
cupy.sparse.coo_matrix.eliminate_zeros
cupy.sparse.csc_matrix.eliminate_zeros
cupyx.scatter_add
cupy.fft
Standard FFTs:
fft, ifft, fft2, ifft2, fftn, ifftn
Real FFTs:
rfft, irfft, rfft2, irfft2., rfftn, irfftn
Hermitian FFTs:
hfft, ihfft
Helper routines:
fftfreq, rfftfreq, fftshift, ifftshift
Newly added functions in v4
• Windows support
• AMD GPU support via HIP
• More useful fusion function
• Add more functions (NumPy, SciPy)
• Add more probability distributions
• Provide simple CUDA kernel
• Support DLPack and
TensorComprehension
– toDLPack() and fromDLPack()
@cupy.fuse()
def sample2(x, y):
return cupy.sum(x + y, axis=0) * 2
CuPy v5 - planned features
Summary: CuPy is a drop-in replacement of NumPy for GPU
1. Highly-compatible with NumPy
━ data types, indexing, broadcasting, operations
━ Users can write CPU/GPU-agnostic code
2. High performance on NVIDIA GPUs
━ cuBLAS, cuDNN, cuRAND, cuSPARSE, and NCCL
3. Easy to install
━ $ pip install cupy
━ $ conda install cupy
4. Easy to write custom kernel
━ ElementwiseKernel, ReductionKernel
import numpy as np
x = np.random.rand(10)
W = np.random.rand(10, 5)
y = np.dot(x, W)
import cupy as cp
x = cp.random.rand(10)
W = cp.random.rand(10, 5)
y = cp.dot(x, W)
to
GPU to
CPU
Your contribution will be highly appreciated & We are hiring!

Más contenido relacionado

La actualidad más candente

Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55
Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55
Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55Preferred Networks
 
CUDA 6の話@関西GPGPU勉強会#5
CUDA 6の話@関西GPGPU勉強会#5CUDA 6の話@関西GPGPU勉強会#5
CUDA 6の話@関西GPGPU勉強会#5Yosuke Onoue
 
JupyterLabを中心とした快適な分析生活
JupyterLabを中心とした快適な分析生活JupyterLabを中心とした快適な分析生活
JupyterLabを中心とした快適な分析生活Classi.corp
 
Kubernetes meetup-tokyo-13-customizing-kubernetes-for-ml-cluster
Kubernetes meetup-tokyo-13-customizing-kubernetes-for-ml-clusterKubernetes meetup-tokyo-13-customizing-kubernetes-for-ml-cluster
Kubernetes meetup-tokyo-13-customizing-kubernetes-for-ml-clusterPreferred Networks
 
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...ScaleGrid.io
 
KubeCon + CloudNativeCon Europe 2022 Recap - Batch/HPCの潮流とScheduler拡張事例 / Kub...
KubeCon + CloudNativeCon Europe 2022 Recap - Batch/HPCの潮流とScheduler拡張事例 / Kub...KubeCon + CloudNativeCon Europe 2022 Recap - Batch/HPCの潮流とScheduler拡張事例 / Kub...
KubeCon + CloudNativeCon Europe 2022 Recap - Batch/HPCの潮流とScheduler拡張事例 / Kub...Preferred Networks
 
研究者のための Python による FPGA 入門
研究者のための Python による FPGA 入門研究者のための Python による FPGA 入門
研究者のための Python による FPGA 入門ryos36
 
Multi Master PostgreSQL Cluster on Kubernetes
Multi Master PostgreSQL Cluster on KubernetesMulti Master PostgreSQL Cluster on Kubernetes
Multi Master PostgreSQL Cluster on KubernetesOhyama Masanori
 
1076: CUDAデバッグ・プロファイリング入門
1076: CUDAデバッグ・プロファイリング入門1076: CUDAデバッグ・プロファイリング入門
1076: CUDAデバッグ・プロファイリング入門NVIDIA Japan
 
Benefits of Multi-rail Cluster Architectures for GPU-based Nodes
Benefits of Multi-rail Cluster Architectures for GPU-based NodesBenefits of Multi-rail Cluster Architectures for GPU-based Nodes
Benefits of Multi-rail Cluster Architectures for GPU-based Nodesinside-BigData.com
 
NGC でインフラ環境整備の時間短縮!素早く始めるディープラーニング
NGC でインフラ環境整備の時間短縮!素早く始めるディープラーニングNGC でインフラ環境整備の時間短縮!素早く始めるディープラーニング
NGC でインフラ環境整備の時間短縮!素早く始めるディープラーニングNVIDIA Japan
 
AllwinnerタブレットのOSを作ってみる(中編)
AllwinnerタブレットのOSを作ってみる(中編)AllwinnerタブレットのOSを作ってみる(中編)
AllwinnerタブレットのOSを作ってみる(中編)shimadah
 
徳島生まれのオープンソース ソフトウェアシラサギのCMS導入事例、機能紹介
徳島生まれのオープンソースソフトウェアシラサギのCMS導入事例、機能紹介徳島生まれのオープンソースソフトウェアシラサギのCMS導入事例、機能紹介
徳島生まれのオープンソース ソフトウェアシラサギのCMS導入事例、機能紹介okusakazuya
 
気候モデル放射カーネルのGPUへの移植と高速化
気候モデル放射カーネルのGPUへの移植と高速化気候モデル放射カーネルのGPUへの移植と高速化
気候モデル放射カーネルのGPUへの移植と高速化Takateru Yamagishi
 
Polyphony: Python ではじめる FPGA
Polyphony: Python ではじめる FPGAPolyphony: Python ではじめる FPGA
Polyphony: Python ではじめる FPGAryos36
 
OpenStreetMap初級編
OpenStreetMap初級編OpenStreetMap初級編
OpenStreetMap初級編義治 蛭本
 
RustによるGPUプログラミング環境
RustによるGPUプログラミング環境RustによるGPUプログラミング環境
RustによるGPUプログラミング環境KiyotomoHiroyasu
 
1072: アプリケーション開発を加速するCUDAライブラリ
1072: アプリケーション開発を加速するCUDAライブラリ1072: アプリケーション開発を加速するCUDAライブラリ
1072: アプリケーション開発を加速するCUDAライブラリNVIDIA Japan
 
2015年度GPGPU実践プログラミング 第4回 GPUでの並列プログラミング(ベクトル和,移動平均,差分法)
2015年度GPGPU実践プログラミング 第4回 GPUでの並列プログラミング(ベクトル和,移動平均,差分法)2015年度GPGPU実践プログラミング 第4回 GPUでの並列プログラミング(ベクトル和,移動平均,差分法)
2015年度GPGPU実践プログラミング 第4回 GPUでの並列プログラミング(ベクトル和,移動平均,差分法)智啓 出川
 
ソフトウェア技術者はFPGAをどのように使うか
ソフトウェア技術者はFPGAをどのように使うかソフトウェア技術者はFPGAをどのように使うか
ソフトウェア技術者はFPGAをどのように使うかなおき きしだ
 

La actualidad más candente (20)

Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55
Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55
Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55
 
CUDA 6の話@関西GPGPU勉強会#5
CUDA 6の話@関西GPGPU勉強会#5CUDA 6の話@関西GPGPU勉強会#5
CUDA 6の話@関西GPGPU勉強会#5
 
JupyterLabを中心とした快適な分析生活
JupyterLabを中心とした快適な分析生活JupyterLabを中心とした快適な分析生活
JupyterLabを中心とした快適な分析生活
 
Kubernetes meetup-tokyo-13-customizing-kubernetes-for-ml-cluster
Kubernetes meetup-tokyo-13-customizing-kubernetes-for-ml-clusterKubernetes meetup-tokyo-13-customizing-kubernetes-for-ml-cluster
Kubernetes meetup-tokyo-13-customizing-kubernetes-for-ml-cluster
 
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
 
KubeCon + CloudNativeCon Europe 2022 Recap - Batch/HPCの潮流とScheduler拡張事例 / Kub...
KubeCon + CloudNativeCon Europe 2022 Recap - Batch/HPCの潮流とScheduler拡張事例 / Kub...KubeCon + CloudNativeCon Europe 2022 Recap - Batch/HPCの潮流とScheduler拡張事例 / Kub...
KubeCon + CloudNativeCon Europe 2022 Recap - Batch/HPCの潮流とScheduler拡張事例 / Kub...
 
研究者のための Python による FPGA 入門
研究者のための Python による FPGA 入門研究者のための Python による FPGA 入門
研究者のための Python による FPGA 入門
 
Multi Master PostgreSQL Cluster on Kubernetes
Multi Master PostgreSQL Cluster on KubernetesMulti Master PostgreSQL Cluster on Kubernetes
Multi Master PostgreSQL Cluster on Kubernetes
 
1076: CUDAデバッグ・プロファイリング入門
1076: CUDAデバッグ・プロファイリング入門1076: CUDAデバッグ・プロファイリング入門
1076: CUDAデバッグ・プロファイリング入門
 
Benefits of Multi-rail Cluster Architectures for GPU-based Nodes
Benefits of Multi-rail Cluster Architectures for GPU-based NodesBenefits of Multi-rail Cluster Architectures for GPU-based Nodes
Benefits of Multi-rail Cluster Architectures for GPU-based Nodes
 
NGC でインフラ環境整備の時間短縮!素早く始めるディープラーニング
NGC でインフラ環境整備の時間短縮!素早く始めるディープラーニングNGC でインフラ環境整備の時間短縮!素早く始めるディープラーニング
NGC でインフラ環境整備の時間短縮!素早く始めるディープラーニング
 
AllwinnerタブレットのOSを作ってみる(中編)
AllwinnerタブレットのOSを作ってみる(中編)AllwinnerタブレットのOSを作ってみる(中編)
AllwinnerタブレットのOSを作ってみる(中編)
 
徳島生まれのオープンソース ソフトウェアシラサギのCMS導入事例、機能紹介
徳島生まれのオープンソースソフトウェアシラサギのCMS導入事例、機能紹介徳島生まれのオープンソースソフトウェアシラサギのCMS導入事例、機能紹介
徳島生まれのオープンソース ソフトウェアシラサギのCMS導入事例、機能紹介
 
気候モデル放射カーネルのGPUへの移植と高速化
気候モデル放射カーネルのGPUへの移植と高速化気候モデル放射カーネルのGPUへの移植と高速化
気候モデル放射カーネルのGPUへの移植と高速化
 
Polyphony: Python ではじめる FPGA
Polyphony: Python ではじめる FPGAPolyphony: Python ではじめる FPGA
Polyphony: Python ではじめる FPGA
 
OpenStreetMap初級編
OpenStreetMap初級編OpenStreetMap初級編
OpenStreetMap初級編
 
RustによるGPUプログラミング環境
RustによるGPUプログラミング環境RustによるGPUプログラミング環境
RustによるGPUプログラミング環境
 
1072: アプリケーション開発を加速するCUDAライブラリ
1072: アプリケーション開発を加速するCUDAライブラリ1072: アプリケーション開発を加速するCUDAライブラリ
1072: アプリケーション開発を加速するCUDAライブラリ
 
2015年度GPGPU実践プログラミング 第4回 GPUでの並列プログラミング(ベクトル和,移動平均,差分法)
2015年度GPGPU実践プログラミング 第4回 GPUでの並列プログラミング(ベクトル和,移動平均,差分法)2015年度GPGPU実践プログラミング 第4回 GPUでの並列プログラミング(ベクトル和,移動平均,差分法)
2015年度GPGPU実践プログラミング 第4回 GPUでの並列プログラミング(ベクトル和,移動平均,差分法)
 
ソフトウェア技術者はFPGAをどのように使うか
ソフトウェア技術者はFPGAをどのように使うかソフトウェア技術者はFPGAをどのように使うか
ソフトウェア技術者はFPGAをどのように使うか
 

Similar a CuPy: A NumPy-compatible Library for GPU

Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSPeterAndreasEntschev
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Preferred Networks
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 
Pycon2014 GPU computing
Pycon2014 GPU computingPycon2014 GPU computing
Pycon2014 GPU computingAshwin Ashok
 
GPU Computing with Ruby
GPU Computing with RubyGPU Computing with Ruby
GPU Computing with RubyShin Yee Chung
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonJen Aman
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsJack (Jaegeun) Han
 
Common Design of Deep Learning Frameworks
Common Design of Deep Learning FrameworksCommon Design of Deep Learning Frameworks
Common Design of Deep Learning FrameworksKenta Oono
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelKoichi Shirahata
 
KubeCon EU 2016: Bringing an open source Containerized Container Platform to ...
KubeCon EU 2016: Bringing an open source Containerized Container Platform to ...KubeCon EU 2016: Bringing an open source Containerized Container Platform to ...
KubeCon EU 2016: Bringing an open source Containerized Container Platform to ...KubeAcademy
 
KURMA - A Containerized Container Platform - KubeCon 2016
KURMA - A Containerized Container Platform - KubeCon 2016KURMA - A Containerized Container Platform - KubeCon 2016
KURMA - A Containerized Container Platform - KubeCon 2016Apcera
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018NVIDIA
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdfTigabu Yaya
 
Univa and SUSE at SC17: Scaling Machine Learning for SUSE Linux Containers, S...
Univa and SUSE at SC17: Scaling Machine Learning for SUSE Linux Containers, S...Univa and SUSE at SC17: Scaling Machine Learning for SUSE Linux Containers, S...
Univa and SUSE at SC17: Scaling Machine Learning for SUSE Linux Containers, S...Ian Lumb
 
CUDA DLI Training Courses at GTC 2019
CUDA DLI Training Courses at GTC 2019CUDA DLI Training Courses at GTC 2019
CUDA DLI Training Courses at GTC 2019NVIDIA
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computingbakers84
 
How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014Puppet
 
Python и программирование GPU (Ивашкевич Глеб)
Python и программирование GPU (Ивашкевич Глеб)Python и программирование GPU (Ивашкевич Глеб)
Python и программирование GPU (Ивашкевич Глеб)IT-Доминанта
 
Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerGeorge Markomanolis
 

Similar a CuPy: A NumPy-compatible Library for GPU (20)

Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
Pycon2014 GPU computing
Pycon2014 GPU computingPycon2014 GPU computing
Pycon2014 GPU computing
 
GPU Computing with Ruby
GPU Computing with RubyGPU Computing with Ruby
GPU Computing with Ruby
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systems
 
Common Design of Deep Learning Frameworks
Common Design of Deep Learning FrameworksCommon Design of Deep Learning Frameworks
Common Design of Deep Learning Frameworks
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
 
KubeCon EU 2016: Bringing an open source Containerized Container Platform to ...
KubeCon EU 2016: Bringing an open source Containerized Container Platform to ...KubeCon EU 2016: Bringing an open source Containerized Container Platform to ...
KubeCon EU 2016: Bringing an open source Containerized Container Platform to ...
 
KURMA - A Containerized Container Platform - KubeCon 2016
KURMA - A Containerized Container Platform - KubeCon 2016KURMA - A Containerized Container Platform - KubeCon 2016
KURMA - A Containerized Container Platform - KubeCon 2016
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
Univa and SUSE at SC17: Scaling Machine Learning for SUSE Linux Containers, S...
Univa and SUSE at SC17: Scaling Machine Learning for SUSE Linux Containers, S...Univa and SUSE at SC17: Scaling Machine Learning for SUSE Linux Containers, S...
Univa and SUSE at SC17: Scaling Machine Learning for SUSE Linux Containers, S...
 
CUDA
CUDACUDA
CUDA
 
CUDA DLI Training Courses at GTC 2019
CUDA DLI Training Courses at GTC 2019CUDA DLI Training Courses at GTC 2019
CUDA DLI Training Courses at GTC 2019
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
 
How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014
 
Python и программирование GPU (Ивашкевич Глеб)
Python и программирование GPU (Ивашкевич Глеб)Python и программирование GPU (Ивашкевич Глеб)
Python и программирование GPU (Ивашкевич Глеб)
 
Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI Supercomputer
 

Más de Shohei Hido

Deep Learning Lab 異常検知入門
Deep Learning Lab 異常検知入門Deep Learning Lab 異常検知入門
Deep Learning Lab 異常検知入門Shohei Hido
 
ディープラーニングの産業応用とそれを支える技術
ディープラーニングの産業応用とそれを支える技術ディープラーニングの産業応用とそれを支える技術
ディープラーニングの産業応用とそれを支える技術Shohei Hido
 
機械学習モデルフォーマットの話:さようならPMML、こんにちはPFA
機械学習モデルフォーマットの話:さようならPMML、こんにちはPFA機械学習モデルフォーマットの話:さようならPMML、こんにちはPFA
機械学習モデルフォーマットの話:さようならPMML、こんにちはPFAShohei Hido
 
Software for Edge Heavy Computing @ INTEROP 2016 Tokyo
Software for Edge Heavy Computing @ INTEROP 2016 TokyoSoftware for Edge Heavy Computing @ INTEROP 2016 Tokyo
Software for Edge Heavy Computing @ INTEROP 2016 TokyoShohei Hido
 
Chainer GTC 2016
Chainer GTC 2016Chainer GTC 2016
Chainer GTC 2016Shohei Hido
 
How AI revolutionizes robotics and automotive industries
How AI revolutionizes robotics and automotive industriesHow AI revolutionizes robotics and automotive industries
How AI revolutionizes robotics and automotive industriesShohei Hido
 
NIPS2015概要資料
NIPS2015概要資料NIPS2015概要資料
NIPS2015概要資料Shohei Hido
 
プロダクトマネージャのお仕事
プロダクトマネージャのお仕事プロダクトマネージャのお仕事
プロダクトマネージャのお仕事Shohei Hido
 
あなたの業務に機械学習を活用する5つのポイント
あなたの業務に機械学習を活用する5つのポイントあなたの業務に機械学習を活用する5つのポイント
あなたの業務に機械学習を活用する5つのポイントShohei Hido
 
PFIセミナー "「失敗の本質」を読む"発表資料
PFIセミナー "「失敗の本質」を読む"発表資料PFIセミナー "「失敗の本質」を読む"発表資料
PFIセミナー "「失敗の本質」を読む"発表資料Shohei Hido
 
NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...
NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...
NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...Shohei Hido
 
機械学習CROSS 後半資料
機械学習CROSS 後半資料機械学習CROSS 後半資料
機械学習CROSS 後半資料Shohei Hido
 
機械学習CROSS 前半資料
機械学習CROSS 前半資料機械学習CROSS 前半資料
機械学習CROSS 前半資料Shohei Hido
 
Jubatus Casual Talks #2 異常検知入門
Jubatus Casual Talks #2 異常検知入門Jubatus Casual Talks #2 異常検知入門
Jubatus Casual Talks #2 異常検知入門Shohei Hido
 
Jubatusが目指すインテリジェンス基盤
Jubatusが目指すインテリジェンス基盤Jubatusが目指すインテリジェンス基盤
Jubatusが目指すインテリジェンス基盤Shohei Hido
 
今年のKDDベストペーパーを実装・公開しました
今年のKDDベストペーパーを実装・公開しました今年のKDDベストペーパーを実装・公開しました
今年のKDDベストペーパーを実装・公開しましたShohei Hido
 
さらば!データサイエンティスト
さらば!データサイエンティストさらば!データサイエンティスト
さらば!データサイエンティストShohei Hido
 
ICML2013読み会 開会宣言
ICML2013読み会 開会宣言ICML2013読み会 開会宣言
ICML2013読み会 開会宣言Shohei Hido
 
ビッグデータはどこまで効率化できるか?
ビッグデータはどこまで効率化できるか?ビッグデータはどこまで効率化できるか?
ビッグデータはどこまで効率化できるか?Shohei Hido
 

Más de Shohei Hido (20)

Deep Learning Lab 異常検知入門
Deep Learning Lab 異常検知入門Deep Learning Lab 異常検知入門
Deep Learning Lab 異常検知入門
 
NIPS2017概要
NIPS2017概要NIPS2017概要
NIPS2017概要
 
ディープラーニングの産業応用とそれを支える技術
ディープラーニングの産業応用とそれを支える技術ディープラーニングの産業応用とそれを支える技術
ディープラーニングの産業応用とそれを支える技術
 
機械学習モデルフォーマットの話:さようならPMML、こんにちはPFA
機械学習モデルフォーマットの話:さようならPMML、こんにちはPFA機械学習モデルフォーマットの話:さようならPMML、こんにちはPFA
機械学習モデルフォーマットの話:さようならPMML、こんにちはPFA
 
Software for Edge Heavy Computing @ INTEROP 2016 Tokyo
Software for Edge Heavy Computing @ INTEROP 2016 TokyoSoftware for Edge Heavy Computing @ INTEROP 2016 Tokyo
Software for Edge Heavy Computing @ INTEROP 2016 Tokyo
 
Chainer GTC 2016
Chainer GTC 2016Chainer GTC 2016
Chainer GTC 2016
 
How AI revolutionizes robotics and automotive industries
How AI revolutionizes robotics and automotive industriesHow AI revolutionizes robotics and automotive industries
How AI revolutionizes robotics and automotive industries
 
NIPS2015概要資料
NIPS2015概要資料NIPS2015概要資料
NIPS2015概要資料
 
プロダクトマネージャのお仕事
プロダクトマネージャのお仕事プロダクトマネージャのお仕事
プロダクトマネージャのお仕事
 
あなたの業務に機械学習を活用する5つのポイント
あなたの業務に機械学習を活用する5つのポイントあなたの業務に機械学習を活用する5つのポイント
あなたの業務に機械学習を活用する5つのポイント
 
PFIセミナー "「失敗の本質」を読む"発表資料
PFIセミナー "「失敗の本質」を読む"発表資料PFIセミナー "「失敗の本質」を読む"発表資料
PFIセミナー "「失敗の本質」を読む"発表資料
 
NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...
NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...
NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...
 
機械学習CROSS 後半資料
機械学習CROSS 後半資料機械学習CROSS 後半資料
機械学習CROSS 後半資料
 
機械学習CROSS 前半資料
機械学習CROSS 前半資料機械学習CROSS 前半資料
機械学習CROSS 前半資料
 
Jubatus Casual Talks #2 異常検知入門
Jubatus Casual Talks #2 異常検知入門Jubatus Casual Talks #2 異常検知入門
Jubatus Casual Talks #2 異常検知入門
 
Jubatusが目指すインテリジェンス基盤
Jubatusが目指すインテリジェンス基盤Jubatusが目指すインテリジェンス基盤
Jubatusが目指すインテリジェンス基盤
 
今年のKDDベストペーパーを実装・公開しました
今年のKDDベストペーパーを実装・公開しました今年のKDDベストペーパーを実装・公開しました
今年のKDDベストペーパーを実装・公開しました
 
さらば!データサイエンティスト
さらば!データサイエンティストさらば!データサイエンティスト
さらば!データサイエンティスト
 
ICML2013読み会 開会宣言
ICML2013読み会 開会宣言ICML2013読み会 開会宣言
ICML2013読み会 開会宣言
 
ビッグデータはどこまで効率化できるか?
ビッグデータはどこまで効率化できるか?ビッグデータはどこまで効率化できるか?
ビッグデータはどこまで効率化できるか?
 

Último

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 

Último (20)

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 

CuPy: A NumPy-compatible Library for GPU

  • 1. A NumPy-compatible Library for GPU Shohei Hido VP of Research Preferred Networks
  • 2. Preferred Networks: An AI Startup in Japan ● Founded: March 2014 (120 engineers and researchers) ● Major news ● $100+M investment from Toyota for autonomous driving ● 2nd place at Amazon Robotics Challenge 2016 ● Fastest ImageNet training on GPU cluster (15 minutes using 1,024 GPUs) 2 Deep learning research Industrial applications Manufacturing Automotive Healthcare
  • 3. Key takeaways ● CuPy is an open-source NumPy for NVIDIA GPU ● Python users can easily write CPU/GPU-agnostic code ● Existing NumPy code can be accelerated thanks to GPU and CUDA libraries
  • 4. ● What is CuPy ● Example: CPU/GPU agnostic implementation of k-means ● Introduction to CuPy ● Recent updates & conclusion
  • 5. CuPy: A NumPy-Compatible Library for NVIDIA GPU ● NumPy is extensively used in Python but GPU is not supported ● GPU is getting faster and more important for scientific computing import numpy as np x_cpu = np.random.rand(10) W_cpu = np.random.rand(10, 5) y_cpu = np.dot(x_cpu, W_cpu) import cupy as cp x_gpu = cp.random.rand(10) W_gpu = cp.random.rand(10, 5) y_gpu = cp.dot(x_gpu, W_gpu) y_gpu = cp.asarray(y_cpu) y_cpu = cp.asnumpy(y_gpu) for xp in [numpy, cupy]: x = xp.random.rand(10) W = xp.random.rand(10, 5) y = xp.dot(x, W) CPU/GPU-agnostic NVIDIA GPUCPU
  • 6. CuPy is actively developed (1,600+ github stars, 11,000+ commits) Ryosuke Okuta CTO Preferred Networks
  • 7. Deep learning framework https://chainer.org/ Probabilistic and graphical modeling https://github.com/jmschrei/pomegranate Natural language processing https://spacy.io/ Python libraries powered by CuPy
  • 8. Reputation (1/2): Travis Oliphant, creator of NumPy and SciPy
  • 9. Reputation (2/2): Stephan Merity of Salesforce Research (MetaMind)
  • 10. Our mission: make CuPy the default tool for GPU computation in Python https://anaconda.org/anaconda/cupy/ ● CuPy is now available on Anaconda in collaboration w/ Anaconda team ● You can install cupy with “$ conda install cupy” on Linux 64-bit ● We are working on Windows version
  • 11. Don’t have GPU for CuPy? Google Colaboratory gives you one (for free!) …
  • 12. ● What is CuPy ● Example: CPU/GPU agnostic implementation of k-means ● Introduction to CuPy ● Recent updates & conclusion
  • 13. Implementation of CPU/GPU agnostic k-means fit(): 37 lines https://github.com/cupy/cupy/blob/master/examples/kmeans/kmeans.py
  • 14. K-means (1/3): Call function and initialization ● fit() follows the training API of scikit-learn ● xp represents either numpy or cupy ● Cluster centers are initialized by positions of random samples <- Specify NumPy or CuPy
  • 15. K-means (2/3): Calculate distance to all of the cluster centers ● xp.linalg.norm is to compute the distance and supported both in numpy and cupy ● _fit_calc_distances() uses custom kernel on cupy
  • 16. Customized kernel with C++ snippet in cupy.ElementwiseKernel ● A kernel is generated by element-wise operation defined in C++ snippet
  • 17. K-means (3/3): Update positions of cluster centers ● xp.stack is to update the cluster centers and supported both in numpy and cupy ● _fit_calc_center() is also custom kernel based
  • 18. Another element-wise kernel ● It just adds all of the points inside each cluster and count the number
  • 19. ● What is CuPy ● Example: CPU/GPU agnostic implementation of k-means ● Introduction to CuPy ● Recent updates & conclusion
  • 20. Performance comparison with NumPy ● CuPy is faster than NumPy even in simple manipulation of large matrix Benchmark code Size CuPy [ms] NumPy [ms] 10^4 0.58 0.03 10^5 0.97 0.20 10^6 1.84 2.00 10^7 12.48 55.55 10^8 84.73 517.17 Benchmark result 6x faster
  • 21. ● Data types (dtypes) ○ bool_, int8, int16, int32, int64, uint8, uint16, uint32, uint64, float16, float32, float64, complex64, and complex128 ● All basic indexing ○ indexing by ints, slices, newaxes, and Ellipsis ● Most of advanced indexing ○ except indexing patterns with boolean masks ● Most of the array creation routines ○ empty, ones_like, diag, etc... ● Most of the array manipulation routines ○ reshape, rollaxis, concatenate, etc... ● All operators with broadcasting ● All universal functions for element-wise operations ○ except those for complex numbers ● Linear algebra functions accelerated by cuBLAS ○ including product: dot, matmul, etc... ○ including decomposition: cholesky, svd, etc... ● Reduction along axes ○ sum, max, argmax, etc... ● Sort operations implemented by Thrust ○ sort, argsort, and lexsort ● Sparse matrix accelerated by cuSPARSE Compatibility with NumPy
  • 22. Comparison with other Python libraries for/on CUDA ● CuPy is the only library that is designed for high compatibility with NumPy still allowing users to write customized CUDA kernels for better performance CuPy PyCUDA MinPy* NVIDIA CUDA support ✔ ✔ ✔ CPU/GPU agnostic coding ✔ ✔ Automatic gradient support ** ✔ NumPy compatible interface ✔ ✔ User-defined CUDA kernel ✔ ✔ * https://github.com/dmlc/minpy ** Autograd is supported by Chainer
  • 23. Inside CuPy ● CuPy extensively relies on NVIDIA libraries for better performance Linear algebra NVIDIA GPU CUDA cuDNN cuBLAS cuRANDcuSPARSE NCCL Thrust Sparse matrix DNN Utility Random numbers cuSOLVER User- defined CUDA kernel Multi- GPU data transfer Sort CuPy
  • 24. Looks very easy? ● CUDA and its libraries are not designed for Python nor NumPy ━ CuPy is not just a wrapper of CUDA libraries for Python ━ CuPy is a fast numerical computation library on GPU with NumPy-compatible API ● NumPy specification is not documented ━ We have carefully investigated some unexpected behaviors of NumPy ━ CuPy tries to replicate NumPy’s behavior as much as possible ● NumPy’s behaviors vary between different versions ━ e.g, NumPy v1.14 changed the output format of __str__ • `[ 0. 1.]` -> `[0. 1.]` (no space)
  • 25. Advanced features of CuPy (1/2) Memory pool GPU Memory profiler Function name Used Bytes Acquired Bytes Occurrence LinearFunction 5.16GB 0.18GB 3900 ReLU 0.99GB 0.46GB 1300 SoftMaxEnropy 7.71MB 5.08MB 1300 Accuracy 0.62MB 0.35MB 700 ● This enables function-wise memory profiling on Chainer ● Avoiding cudaMalloc is a common practice in CUDA programming ● CuPy supports memory pooling using Best-Fit with Coalescing (BFC) algorithm ● It reduces memory usage to 25% on seq2seq model
  • 26. Advanced features of CuPy (2/2) Kernel fusion (experimental) @cp.fuse() def fused_func(x, y, z): return (x * y) + z ● By adding decorator @cp.fuse(), CuPy stores a series of operations ● Then it compiles a single kernel to execute the operations
  • 27. ● What is CuPy ● Example: CPU/GPU agnostic implementation of k-means ● Introduction to CuPy ● Recent updates & conclusion
  • 28. • Start providing pre-built wheel packages of CuPy – cupy-cuda80, cupy-cuda90, and cupy-cuda91 – $ pip install cupy-cuda80 • Memory pool is now the default allocator – Added line memory profiler using memory hook and traceback • CUDA stream is fully supported stream = cupy.cuda.stream.Stream() with stream: y = cupy.linalg.norm(x) stream.synchronize() stream = cupy.cuda.stream.Stream() stream.use() y = cupy.linalg.norm(x) stream.synchronize() What’s new in CuPy v4?
  • 30. • Windows support • AMD GPU support via HIP • More useful fusion function • Add more functions (NumPy, SciPy) • Add more probability distributions • Provide simple CUDA kernel • Support DLPack and TensorComprehension – toDLPack() and fromDLPack() @cupy.fuse() def sample2(x, y): return cupy.sum(x + y, axis=0) * 2 CuPy v5 - planned features
  • 31. Summary: CuPy is a drop-in replacement of NumPy for GPU 1. Highly-compatible with NumPy ━ data types, indexing, broadcasting, operations ━ Users can write CPU/GPU-agnostic code 2. High performance on NVIDIA GPUs ━ cuBLAS, cuDNN, cuRAND, cuSPARSE, and NCCL 3. Easy to install ━ $ pip install cupy ━ $ conda install cupy 4. Easy to write custom kernel ━ ElementwiseKernel, ReductionKernel import numpy as np x = np.random.rand(10) W = np.random.rand(10, 5) y = np.dot(x, W) import cupy as cp x = cp.random.rand(10) W = cp.random.rand(10, 5) y = cp.dot(x, W) to GPU to CPU Your contribution will be highly appreciated & We are hiring!