Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

On the benchmark of Chainer

50.153 visualizaciones

Publicado el

2nd July 2016 Chainer Meetup #3

Publicado en: Tecnología
  • Sé el primero en comentar

On the benchmark of Chainer

  1. 1. On the benchmark of Chainer 2016年7⽉2⽇ Chainer Meetup #3@Dowango Seminar Room Preferred Networks Inc. Kenta Oono oono@preferred.jp
  2. 2. Self Introduction • Kenta Oono (twitter: @delta2323_) – Bio. : MSc@MathSci Univ. Tokyo → 2012.4 PFI → 2014.10 PFN – Role: BioHealthcare project, Chainer dev. team, etc. – blog: http://delta2323.github.io • Recent activity – Study meetup (NIPS2014, ICML2015, NIPS2015) – Several articles and talks on Deep Learning 7⽉21⽇ ICML2016読み会 @ドワンゴセミナールーム
  3. 3. What is Benchmark? • Metrics that evaluate the performance of frameworks – elapsed time, memory consumption, easiness of use etc. • Related to, but different from profiling – Profiling needs finer information of frameworks, possibly at the cost of performance – Benchmarking measures the overall behavior of frameworks • For framework developers: – provides suggestion for further enhancement of the framework – provides objective comparison with other frameworks • For framework users: – provides better choice of frameworks that satisfies their needs
  4. 4. Example: convnet-benchmarks • Author: Soumith Chintala(Facebook AI Research) • Measures latencies of convolutional neural networks • Provides objective comparison across various frameworks • Metric – Elapsed time of forward and backward propagation • Architecture – AlexNet-OWT / Overfeat / VGG-A / GoogleNet – Single 2D convolution layer of various sizes • Frameworks – Torch, neon, TenforFlow, fbfft (Torch), Chainer, cudaconvnet2, Caffe, CL-nn, Caffe-CL GreenTea etc.
  5. 5. convnet-benchmark
  6. 6. Basics of measurement of kernel execution • We cannot measure GPU execution time as CPU because launch of kernels is asynchronous ! clock_t start, end; start = clock(); // launch kernel end = clock(); elapsed_time = end - start; CPU GPU kernel exec. clock kick clock kernel exec.
  7. 7. Basics of measurement of kernel execution • We can measure the kernel execution time by inserting two events at the start and end of the launch. float elapsed=0; cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start, 0); // launch the kernel cudaEventRecord(stop, 0); cudaEventSynchronize (stop); cudaEventElapsedTime( &elapsed, start, stop); cudaEventDestroy(start); cudaEventDestroy(stop); CPU GPU record kick record kernel exec. kernel exec. Event Eventsync
  8. 8. Measurement of single Chainer Function Execution • Suppose GPU impl. of F.f consists of Python parts and single GPU kernel. • Elapsed time calculates kernel execution in this case. CPU GPU Python kick record kernel exec. kernel exec. Event Eventsync Python record start = cupy.cuda.Event() end = cupy.cuda.Event() start.record() y = F.f(x) # forward prop end.record() end.synchronize() cupy.cuda.get_elapsed_time( start, end) F.f
  9. 9. Measurement of single Chainer Function Execution • Suppose – no other kernels are waiting in the queue – Python overhead is large – the kernel is light • get_elapsed_time equals to whole execution time including Python code. CPU GPU Python kick record kernel exec. Event Event sync Python record start = cupy.cuda.Event() end = cupy.cuda.Event() start.record() y = F.f(x) # forward prop end.record() end.synchronize() cupy.cuda.get_elapsed_time( start, end)
  10. 10. Measurement of single Chainer Function Execution • In general, the elapsed time between two events are different from what we measured in the two previous situations. • What we really measure depends on – the status of the waiting queue – the amounts of Python code and kernel CPU GPU Python kick record kernel exec. Event Event sync Python record kernel exec. start = cupy.cuda.Event() end = cupy.cuda.Event() start.record() y = F.f(x) # forward prop end.record() end.synchronize() cupy.cuda.get_elapsed_time( start, end)
  11. 11. Synchronization before start Event • It ensures the start Event point is right before the execution of Python code. • But the timing of end Event is still undetermined. start = cupy.cuda.Event() end = cupy.cuda.Event() start.record() start.synchronize() y = F.f(x) # forward prop end.record() end.synchronize() cupy.cuda.get_elapsed_time( start, end) CPU GPU kick record kernel exec. Event kernel exec. sync Python ・・・ ・・・
  12. 12. Measurement of multi-layered NNs • Should we insert synchronization points before all function executions? • But it exposes Python code that should have been hidden if it were not for the synchronization. • I guess this is the reason why convnet- benchmarks offers the architectures that consist of single convolution layer. CPU GPU kick record kernel exe- cution Event kernel exe- cution sync Python record Python kick record sync Python record Python Event Event kernel exe- cution It would be hidden by the kernel execution if we did not measure elapsed times.
  13. 13. Tentative solution (Timer class: PR #1249) • Offers start and stop methods for measuring lap times. • Three patterns for synchronization before measurement by blocking_methodargument – block_every_time: synchronizes every start events – block_first_time: synchronizes only first start event – non_block: does not synchronize at the start of measurement • When we get the total time, Timer class implicitly call synchronize method. • synchronize method synchronizes all Events inserted by start and stop and calculates lap times lazily. • Once synchronize is invoked, the timer CANNOT accumulate lap times until it is reset.
  14. 14. DeepMark aurthored by Soumith Chintala • Comparison with convnet-benchmarks – Not only image recognition but also various use cases – Relatively newer architectures are employed – Multi-GPU evaluation will be supported (planned) • Many details of specifications are under discussion. • Architectures (planned) – Images : InceptionV3-batchnorm / Alexnet-OWT / VGG-D / ResNet-50 – Video: C3D – Audio: DeepSpeech2 / MSR's 5 layer FC – Text: Small RNN LSTM / Large RNN LSTM • Chainer support (delta2323/chainer-deepmark) – Not all features are supported (see issue for details)
  15. 15. Conclusion • Measurement of elapsed time of multi-layered NNs have many things to be considered. • We will participate in DeepMark, a general-purpose deep learning benchmarks. • Many criteria are to be measured: – Elapsed time <- Today’s topic – Memory consumption – etc… We are hiring!

×