Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

“High-Efficiency Edge Vision Processing Based on Dynamically Reconfigurable TPU Technology,” a Presentation from Flex Logix

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio

Eche un vistazo a continuación

1 de 18 Anuncio

“High-Efficiency Edge Vision Processing Based on Dynamically Reconfigurable TPU Technology,” a Presentation from Flex Logix

Descargar para leer sin conexión

For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2022/06/high-efficiency-edge-vision-processing-based-on-dynamically-reconfigurable-tpu-technology-a-presentation-from-flex-logix/

Cheng Wang, Senior Vice President and Co-founder of Flex Logix, presents the “High-Efficiency Edge Vision Processing Based on Dynamically Reconfigurable TPU Technology” tutorial at the May 2022 Embedded Vision Summit.

To achieve high accuracy, edge computer vision requires teraops of processing to be executed in fractions of a second. Additionally, edge systems are constrained in terms of power and cost. This talk presents and demonstrates the novel dynamic TPU array architecture of Flex Logix’s InferX X1 accelerators and contrasts it to current GPU, TPU and other approaches to delivering the teraops computing required by edge vision inferencing.

Wang compares latency, throughput, memory utilization, power dissipation and overall solution cost. He also shows how existing trained models can be easily ported to run on the InferX X1 accelerator.

For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2022/06/high-efficiency-edge-vision-processing-based-on-dynamically-reconfigurable-tpu-technology-a-presentation-from-flex-logix/

Cheng Wang, Senior Vice President and Co-founder of Flex Logix, presents the “High-Efficiency Edge Vision Processing Based on Dynamically Reconfigurable TPU Technology” tutorial at the May 2022 Embedded Vision Summit.

To achieve high accuracy, edge computer vision requires teraops of processing to be executed in fractions of a second. Additionally, edge systems are constrained in terms of power and cost. This talk presents and demonstrates the novel dynamic TPU array architecture of Flex Logix’s InferX X1 accelerators and contrasts it to current GPU, TPU and other approaches to delivering the teraops computing required by edge vision inferencing.

Wang compares latency, throughput, memory utilization, power dissipation and overall solution cost. He also shows how existing trained models can be easily ported to run on the InferX X1 accelerator.

Anuncio
Anuncio

Más Contenido Relacionado

Similares a “High-Efficiency Edge Vision Processing Based on Dynamically Reconfigurable TPU Technology,” a Presentation from Flex Logix (20)

Más de Edge AI and Vision Alliance (20)

Anuncio

Más reciente (20)

“High-Efficiency Edge Vision Processing Based on Dynamically Reconfigurable TPU Technology,” a Presentation from Flex Logix

  1. 1. 1 High-Efficiency Edge Vision Processing Based on Dynamically Reconfigurable TPU Technology Cheng C. Wang, Co-Founder and CTO Flex Logix Technologies
  2. 2. © 2022 Flex Logix Technologies Flex Logix: A technology leader 2 Cheng Wang, CTO, Co-founder • Industry expert with track record in tech innovation • Winner: ISSCC Outstanding Paper Award, the premier chip design award. (Recent winners include IBM, Toshiba, Nvidia and Sandisk) Flex Logix: • Founded in 2014 • Profitable embedded FPGA IP business • Edge AI Inference accelerator based on tensor array + programmable logic • logic • Backed by top technology and innovation investors • Growing rapidly – We’re hiring! • Based in Mountain View, CA, with offices in Austin, TX Geoff Tate, CEO • Experienced executive taking company public • Rambus: 4 people to IPO to $2B
  3. 3. © 2022 Flex Logix Technologies Short history of vision processing In 1966, American computer scientist and co-founder of the MIT AI Lab Marvin Minsky hired a first-year undergraduate student, Gerald Sussman, to spend the summer linking a camera to a computer and getting the computer to describe what it saw. Needless to say, Sussman didn’t make the deadline. Vision turned out to be one of the most difficult and frustrating challenges in AI over the next four decades. 3
  4. 4. © 2022 Flex Logix Technologies • Operator Types: 11x11, 5x5, 3x3, MaxPool 3x3s2, FC • Total Layers: 8 • Output is classification to 1000 classes • Operations per Inference: 724 Million Operations per Inference: 724 Million AlexNet 2012 (ten years ago) ImageNet competition winner 4
  5. 5. © 2022 Flex Logix Technologies • General Purpose Processing? • Operating Systems? • Rugged/Industrial Computers? • Network Connectivity? • Software Paradigm? • Imaging Sensor? What are the tough new problems? All Solved Problems 5
  6. 6. © 2022 Flex Logix Technologies • Software Complexity • How do you program a trillion operations? • TeraOp Processing Efficiency • How do you fit TeraOps in a factory? What are the tough new problems? 6
  7. 7. © 2022 Flex Logix Technologies Miniaturizing AI vision systems Providing TeraOps of vision processing into smaller form-factors is a real challenge Industrial Vision Computers Compact Vision Box High Power – 250 W Card System Power – 750 W Low Power – 8 W Card System Power – 30 W $$$$ $ 7
  8. 8. © 2022 Flex Logix Technologies • Good for inference and training • Tons of memory bandwidth • Numerous precision choices • Good SW ecosystem to get started • But .. • Large, expensive, power hungry • Further amplified at the system level • Easy to get started, but hard to optimize for • Support difficulties • Supply-chain difficulties GPUs offer flexibility but at a price Flexible but… • Large • Expensive • Power hungry 8
  9. 9. © 2022 Flex Logix Technologies It’s all about the memory GPUs are architected for DDR, not local memories • Computation is designed to access GDDR memory • NVLink used for further expansion • Only 4 + 12 MB of L2 + RF Highly parallelized version of Von Neumann architecture • Still inefficient, but highly flexible Turing TU104 Full Chip Diagram 9
  10. 10. © 2022 Flex Logix Technologies • Models are evolving rapidly, even the same- model is going through incremental changes • Different versions, flavors (s/m/l/xl), image sizes, are added • New operators are introduced regularly • Flexible architecture is a MUST • Dedicated ASICs cannot chase a moving target Fast model evolution – Flexibility is key YOLOv5 Changes V1 May, 2020 Initial Release V2 July, 2020 LeakyReLU(0.1) V3 August 2020 HardSwish activations added V4 January 2021 siLU() replaces leakyReLU and hardswish V5 April 2021 Added P6 (1280x1280) models V6 October 2021 SPP -> SPPF, C3 changes 10
  11. 11. © 2022 Flex Logix Technologies • Graph streaming reduces DRAM requirements, but BW matching is difficult • Each operator requires different amount of compute • How to maintain efficiency with load imbalance? Load-balancing difficult for streaming architectures 0 0.5 1 1.5 2 2.5 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100 103 106 109 TeraOps per Layer Model Layer 11
  12. 12. © 2022 Flex Logix Technologies Efficient data access: Each 1D TPU core can stream data from: • Neighboring 1D TPU (dedicated) • Any 1D TPU (via XFLX) • L2 SRAM (via XFLX) • DDR (via NoC) Flexible control and data path: • “Future proof” compute, activations, and generic operators via EFLX Dynamic TPU: Flexible, balanced & memory-efficient 12
  13. 13. © 2022 Flex Logix Technologies • Stream data at a sub-graph level with efficient BW matching Layer fusion – Match workloads at sublayer level 0 0.5 1 1.5 2 2.5 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100 103 106 109 TeraOps per Layer L2 1DTPU + 1DTPU L2 1DTPU 1DTPU L2 L2 1DTPU + 1DTPU L2 1DTPU 1DTPU L2 Input Memory MACS Misc Ops Output Memory L2 1DTPU + 1DTPU L2 1DTPU 1DTPU L2 1DTPU + 1DTPU L2 1DTPU 1DTPU + 1DTPU 1DTPU + 1DTPU 1DTPU 1DTPU 1DTPU 13
  14. 14. © 2022 Flex Logix Technologies Comparing GPU to our dynamic TPU GPU Dynamic TPU Lots of MACs Fewer MACs, but more efficiently used Computes predominantly via GDDR Compute via local connections, XFLX connections, flexible L2s, and DDR Lots of SW but hard to achieve efficiency Easy to achieve efficiency with X1 XDK Many GDDR Memory (256-bit) Few LPDDR (32-bit) 75 – 300 W (typ.) 6 – 10 W (typ.) Flexible Equally flexible via EFLX & XFLX Large, expensive & brute-force Small, low-cost & efficient 14
  15. 15. © 2022 Flex Logix Technologies Embedded Application + X1 X1 drivers X1 Runtime Embedded App at The Edge Runtime API Neural Net Conv AvgPool MaxPool Concat .ncf InferXDK Software Development Toolkit System level view of InferXDK software 15 NN Model Framework
  16. 16. © 2022 Flex Logix Technologies 2019: Xin Feng, Computer vision algorithms and hardware implementations: A survey • X1 provides ASIC performance/efficiency with flexibility of software • InferX SDK directly converts neural network graph model to dynamic InferX hardware instance • Much more flexible & future proof vs ASIC solutions • Much higher efficiency (Inf/W & Inf/$) vs CPU and GPU based solution • Thus enabling compact form factors such as M.2 2280 B+M >2x efficiency Superior performance with SW flexibility InferX
  17. 17. © 2022 Flex Logix Technologies • Putting TeraOps of performance in low power edge devices is today’s challenge • InferX was designed from scratch to solve this problem • > 10x improvement in efficiency versus comparable GPUs • Smart architecture reduces memory bandwidth and capacity requirements • While supporting complex models at high throughput • Designed to fit in small and low power system form factors Come visit us in the expo hall for demonstrations for InferX technology Conclusions 17
  18. 18. Thank you!

×