Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

"Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio

Eche un vistazo a continuación

1 de 35 Anuncio

"Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Descargar para leer sin conexión

For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/luxoft/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit

For more information about embedded vision, please visit:
http://www.embedded-vision.com

Alexey Rybakov, Senior Director at LUXOFT, presents the "Making Computer Vision Software Run Fast on Your Embedded Platform" tutorial at the May 2016 Embedded Vision Summit.

Many computer vision algorithms perform well on desktop class systems, but struggle on resource constrained embedded platforms. This how-to talk provides a comprehensive overview of various optimization methods that make vision software run fast on low power, small footprint hardware that is widely used in automotive, surveillance, and mobile devices. The presentation explores practical aspects of deep algorithm and software optimization such as thinning of input data, using dynamic regions of interest, mastering data pipelines and memory access, overcoming compiler inefficiencies, and more.

For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/luxoft/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit

For more information about embedded vision, please visit:
http://www.embedded-vision.com

Alexey Rybakov, Senior Director at LUXOFT, presents the "Making Computer Vision Software Run Fast on Your Embedded Platform" tutorial at the May 2016 Embedded Vision Summit.

Many computer vision algorithms perform well on desktop class systems, but struggle on resource constrained embedded platforms. This how-to talk provides a comprehensive overview of various optimization methods that make vision software run fast on low power, small footprint hardware that is widely used in automotive, surveillance, and mobile devices. The presentation explores practical aspects of deep algorithm and software optimization such as thinning of input data, using dynamic regions of interest, mastering data pipelines and memory access, overcoming compiler inefficiencies, and more.

Anuncio
Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Similares a "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT (20)

Anuncio

Más de Edge AI and Vision Alliance (20)

Más reciente (20)

Anuncio

"Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

  1. 1. Copyright © 2016 LUXOFT 1 Alexey Rybakov, LUXOFT May 3, 2016 Making Computer Vision Software Run Fast on Your Embedded Platform Art and Science of Optimization
  2. 2. Copyright © 2016 LUXOFT 2 Global Software Engineering: • Low-Power GPU Software • Custom Vision Software Why LUXOFT is Giving This Talk 10,000+ Luxoft software engineers
  3. 3. Copyright © 2016 LUXOFT 3 • Obstruction Removal for Drones • CAFFE on ARM Mali • OpenCV on ImgTec PowerVR • HDR Encoding on GPU-based • Low-power Motion Stabilization • GPU-optimized 4K VP9 video codec • See demos at our booth Our Optimization Projects Covered in This Talk Drone Vision Fast OpenCV HDR on GPUCaffe on GPU Stabilization Fast 4K Codecs
  4. 4. Copyright © 2016 LUXOFT 4 • Qualifying question: Who Develops Computer Vision Software? • Typical situations in embedded SW development: • Great new algorithm  Implement • Implementation platform: Desktop-class  Embedded* • Decision making: Delayed  Real-time* • Performance: Low FPS  High FPS* Poll * Context of this presentation
  5. 5. Copyright © 2016 LUXOFT 5 • Need: reliable, real-time, on-device, decision-making from visual data...implemented on a constrained HW platform (with exotic architecture) • What to do 1. Map CV pipeline onto HW platform 2. Rethink system requirements 3. Rework algorithm logic 4. Use GPU, DSP and other aid (properly!) 5. Code optimization 6. Know your platform inside out Embedded Vision: Challenges and Opportunities
  6. 6. Copyright © 2016 LUXOFT 6 Map CV Pipeline onto HW Platform 1.
  7. 7. Copyright © 2016 LUXOFT 7 Embedded Vision: Pipeline and Hardware
  8. 8. Copyright © 2016 LUXOFT 8 Evaluate your platform: • Hardware features and accelerators, slow/fast memory, power management? • Support from run-time: OS, drivers, OpenCL, CUDA, other frameworks? • Toolchain: Compiler, debugger, profiler, [access to] documentation, optimization guides? • Available CV frameworks: OpenCV, IPP, fastCV, other? Benchmark your embedded platform vs. reference: • Run simple tests: data copy, access, vectorization, memory use, energy management • Test if CV-framework functions are optimized (coverage is often low) …This will give you measured optimization goal Study and Test HW Platform
  9. 9. Copyright © 2016 LUXOFT 9 Mapping to Platform: Histogram Example Histo* 2 ms Histo equali- zation Apply LUT Histo 4.2 ms Histo equalization Apply LUT Camera Camera * Histogram collection on CPU is more than 2 times faster than on GPU ** Histogram equalization is a 1 thread, iterative histogram processing, so GPU implementation is not reasonable. 16.2 ms 2 MB data transfer (HD frame) 1 KB data transfer 1 KB data transfer 1 KB data transfer GPU processing CPU processing Memory transfers HOST  GPU = 1.33 GB/s GPU HOST = 0.11 GB/s SOC: Intel Merrifield platform, Device: Dell Venue 3840 Option A vs. Option B
  10. 10. Copyright © 2016 LUXOFT 10 Rethink System Requirements! 2.
  11. 11. Copyright © 2016 LUXOFT 11 • Important concept: “Good enough” • How does your use case differ from classic/desktop requirements? Art of “controlled worse” • What decision latency do you need? • What resolution/precision? • Do you need all frame or a region? Optimize System Requirements
  12. 12. Copyright © 2016 LUXOFT 12 • Universal implementation*  Our Drone implementation • Any motion  Linear motion • Any obstacles  Opaque obstacles • Have only image data  Use sensor fusion (gyro) • More than 100X faster! Rethink Requirements: Obstruction Removal, Drone Edition Camera Output *MIT CSAIL and Google Research, SIGGRAPH 2015
  13. 13. Copyright © 2016 LUXOFT 13 Rework Algorithm Logic 3.
  14. 14. Copyright © 2016 LUXOFT 14 • Desktop  Embedded • High-Res  Downsampling / pyramid • Color  Monochrome or luminance • Entire frame  Regions of Interest only • ROI cascading example: HOG to DNN • Every frame  1/N + approximation • Inter-frame cascading: Detection to Tracking • Image only  Sensor fusion • Example: gyro + vision for motion est. • CPU  Parallelize for GPU Algorithm Optimization Opportunities
  15. 15. Copyright © 2016 LUXOFT 15 • Motion Vector Field only for 3x3 (pyramid downsampling) • Only shift and rotation •  1000x+ performance •  Real-time 4K UHD on mobile Optimized Video Stabilization Algorithm • Motion Vector Field only for 3x3 grid (pyramid downsampling) • Only shift and rotation • Inter-frame border reconstruction •  1000x+ performance •  Real-time 4K UHD on mobile
  16. 16. Copyright © 2016 LUXOFT 16 Use GPU and Other Aid (Properly) 4.
  17. 17. Copyright © 2016 LUXOFT 17 • Good news: computer vision is very parallelizable • Bad news: coordination between CPU and GPU (and other compute devices) is a tricky part • GPU: What to do (beyond algorithm-to-platform mapping and reworked logic) • A few simple rules: memory types, datatypes, workroup size, memory alignment • Master the art of kernel synchronization: load your cores • Use GPU pre-optimized libraries, like OpenCV on some platforms • Master OpenCL • Also explore available ISP or DSP benefits. Use GPU. Properly
  18. 18. Copyright © 2016 LUXOFT 18 1. Memory Hierarchy 2. Task Synchronization • Example of both: Large Matrix Transpose GPU, Two Key Concepts
  19. 19. Copyright © 2016 LUXOFT 19 Original. All FPS measured on Galaxy S7: • Run existing DNN framework: CAFFE • =0.7 FPS (EIGEN OpenCL library) CPU Optimization (not a through road): • Optimized version for Android: DNN optimized OpenBLAS: OpenMP and NEON  +2 FPS GPU Optimizations: • Better OpenCL implementation on ViennaCL library  +0.5 FPS • Found bottleneck: SGEMM functions •  Rewrite SGEMM (workgroup size, vectorization, etc)  +4.5 FPS Final optimized performance: 5-6 FPS ARM Mali Accelerated CAFFE Open Source CPU, 1 thread Open Source GPU OpenCL (ViennaCL) Open Source CPU multithreaded, NEON LUXOFT 0.7 FPS 1.2 FPS 2.5 FPS 5.4 FPS
  20. 20. Copyright © 2016 LUXOFT 20 ARM Mali Accelerated CAFFE: Benchmarks Legend Colors • FPS • CPU Load • Battery Charge Lines • CPU • Optimized GPU
  21. 21. Copyright © 2016 LUXOFT 21 VP9 Video Decoder Optimization for GPU Parsing & Entropy Decode Motion Compen sation Intra Prediction Inverse Quant Inverse Transform Reconst ruction Loop filtering • CPU: Superblock-level parallelism Parsing & Entropy Decode Motion Compensati on Intra Prediction Inverse Quant Inverse Transform Reconstructi on Loop filtering • GPU: Frame-level parallelism • Uses more memory Input frame Input frame Output frame Output frame Optimization result: 2x-5x FPS depending on bitrate. Platforms: AMD, Intel, NVidia SoCs Original CPU Algorithm GPU processing CPU processing Reworked and Optimized GPU Algorithm
  22. 22. Copyright © 2016 LUXOFT 22 Code Optimization 5.
  23. 23. Copyright © 2016 LUXOFT 23 • Two enemies 1. Computation 2. Data transfers • Waste of time = waste of energy Controversial example  ARM compiler does it automatically Some others don’t Two Enemies: Code and Data Don’t calculate - Use table/lookup functions, - Use polynomial approximations Use classic techniques - Like loop unrolling, - Converting to native data types Don’t move data - Use local and cache memory - Partition/group DRAM access Benchmark everything - Compiler computation options - Memory transfers
  24. 24. Copyright © 2016 LUXOFT 24 OpenCV local contrast for HD camera adjustment in real time • Existing OpenCV histogram implementations don‘t fit into 1080p frame processing budget (need 16 ms/frame for the entire algoithm chain to obtain 60 FPS) Optimization Results  Things to do • Experiment • Benchmark • Chose the best method OpenCV on ImgTec PowerVR GPU: Histogram Example Histogram Gathering Method Time, ms OpenCV histogram (CPU) 7.5 ms OpenCV histogram (GPU) 4.4 ms Luxoft-PowerVR (atomic_add to global memory) 0.69 ms Luxoft-PowerVR (atomic_add to local memory) 7.51 ms Luxoft-PowerVR (increment at local memory) 3.28 ms
  25. 25. Copyright © 2016 LUXOFT 25 • Example: “memory tiling” Tiled memory layout may give 2x-3x performance gain for vision algorithms: 1 DRAM read vs. 4 DRAM reads in matrix transpose Example: Fighting Data Transfers • Reference you need to obtain or produce (will vary by CPU/GPU of your choice)
  26. 26. Copyright © 2016 LUXOFT 26 Know Your Platform Inside Out 6.
  27. 27. Copyright © 2016 LUXOFT 27 • Things to do • Study documentation and optimization guides for your exact HW • Again, test/benchmark a feature before you critically rely on it • What works for you • Modern GPUs and DSPs may implement the entire algorithm in 1 instruction • What works against you • Don’t assume everything will work as documented • “Fast” memory …may be slow (like early versions of Snapdragon) • Great technology …but no documentation and no code examples (like iOS Metal for compute) Platform Specifics
  28. 28. Copyright © 2016 LUXOFT 28 • Motion vector field upsampling, common task for CV • OpenCL supports bilinear interpolation of everything • How to, AMD OpenCL implementation • AMD has QSAD function – the fastest way to SAD for blocks • Keep MVF in Image2D • Use sampler with CLK_FILTER_BILINEAR Platform Example: AMD GPU for Frame Interpolation Basic Optimized
  29. 29. Copyright © 2016 LUXOFT 29 iOS Metal Compute Findings: • No code examples for compute, weak documentation = blackbox • Only 64 GitHub repos, no serious projects • xCode profiler does not work with Metal Compute  use workarounds: manual timer-based profiling • Vector types actually not fully supported by a compiler  test everything, then use workaround: use combined approach with scalars and vectors Encountered while working on GPU-optimized JPEG-HDR encoding on iPhone We still achieved about 3x-4x faster JPEG Encode on iPhone … just took a lot of extra work Platform Example: Apple iOS Metal for GPU Compute
  30. 30. Copyright © 2016 LUXOFT 30 Lessons Learned and Resources !
  31. 31. Copyright © 2016 LUXOFT 31 1. Learn, test, profile, and benchmark every component of your system. Including compiler. Don’t assume. 2. Don’t port 1:1. Rework requirements and algorithm logic too. 3. GPU and other non-CPU compute architectures may give fantastic results. 4. Use parallelization and computer vision frameworks like OpenCL or OpenCV. Rewrite critical parts there as needed. 5. Modern HW platforms implement popular algorithms in one function call. Study platform-specific optimization guides. 6. Sometimes things won’t work as documented. This is normal. 7. Optimization is a mix of art and science. Think outside the box. Lessons Learned
  32. 32. Copyright © 2016 LUXOFT 32 • Embedded Vision Alliance: http://www.embedded-vision.com/ • Platform optimization guides and blog posts from: • Altera (now Intel), AMD, ARM, Imagination Technologies, NVidia, Qualcomm, TI • Luxoft Computer Vision team: vision@luxoft.com Resources
  33. 33. Copyright © 2016 LUXOFT 33 Thank you! LUXOFT Presentation R&D Team: Aleksandr Bobrovnik Aleksandr Volkov Alexey Rybakov Anton Veselov Artem Galin Dmitriy Marenkov Dmitry Ivanov Ekaterina Popova Ihor Starepravo Ildar Valiev Marat Gilmutdinov Nikolay Nemcev Oleksandr Murovanyi Sergey Fedorov Valery Bobrov Viktor Pasoshnikov
  34. 34. Copyright © 2016 LUXOFT 34 See demos at our booth. And email me too ?Alexey Rybakov Senior Director, Embedded LUXOFT, Menlo Park, CA ARybakov@luxoft.com

×