Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
TFLite NNAPI and
GPU Delegates
Koan-Sin Tan

freedom@computer.org

Aug 18th, 2019

COSCUP 2019, Taipei, Taiwan
• disclaimer: Opinions Are My Own

• feel free to interrupt me if you have any questions

• questions in English, Taiwanes...
who i am
• Used open source before the term “open
source” is used

• A software guy, learned to use Unix and open
source s...
Delegation
• Delegation: one of the commonly
used old mechanisms mentioned in
the GoF book

• presumably, you know this we...
So, what is a TFLite
delegate?
• “A TensorFlow Lite delegate is a way to delegate part or all of graph execution to anothe...
What is TFLite
• An lightweight inference engine

• originally for Android and
similar platforms. Extended to
micro-contro...
TfLiteContext
• TfLiteContext: reporting
facilities and access to global
objects, including all the
tensors

• TfLiteNode:...
TfLiteNode
• TfLiteContext: reporting
facilities and access to global
objects, including all the
tensors

• TfLiteNode: a ...
TfLiteRegistration
• TfLiteContext: reporting
facilities and access to global
objects, including all the
tensors

• TfLite...
To know more
• Read [1][2] and create a custom op will help
understanding TfLiteRegistration, TfLiteNode, and
TfLiteContex...
TfLiteDelegate: the
interface
• In case you didn’t notices it
yet, TFLite is mainly written in
C++

• C API for FFI from o...
How TFLite delegates
work?
• Let's say we have a simple model graph such as the following:

• Let's assume that there is a...
1×224×224×3
1×1001
TfLiteNnapiDelegate
1 32×3×3×3
2 1×3×3×512
3 512×1×1×512
4 1×3×3×512
5 512×1×1×512
6 1×3×3×512
7 1024×1...
delegates in TFLite
• NNAPI delegate

• mainly for Android

• GPU delegate: NNAPI, which as introduced in Android O MR1 (l...
NNAPI-enabled devices ~ 25.8% around May 7, 2019
https://developer.android.com/about/dashboards15
16
GL ES compute shader capable devices ~ 50%
https://developer.android.com/about/dashboards
Android NN API
• Announced/published with Android 8.1
Preview 1

• Available to developer in NDK

• yes, NDK

• The Androi...
So, what a delegate is
supposed to implement
• Understanding how to
add a delegate helps

• define a kernel node,
which mea...
NNAPI delegate
• C++ code: instead of C style
one

• derived from TfLiteDelegate

• Some private data
structures

• extra ...
data
• execution_preference

• power/perf tradeoff: not
widely supported as far as I
can tell

• accelerator_name: e.g.,
“f...
TfLiteRegistration for
nnapi_delegate_kernel
• init()

• free()

• prepare()

• invoke()

• no profiling_string()

• builti...
Init() of NNAPI Delegate
Kernel
• mainly for NNAPI initialization

ANeuralNetworksCompilation_*()
• and build graph

• if ...
INT8 —> UINT8 conversion
• Original TFLite and NNAPI uses asymmetric UINT8 quantization

• asymmetric one provides more fle...
Invoke() of NNAPI Delegate
Kernel
• mainly memory management
and 

ANeuralNetworksExecution*()
• To digger more we have to...
DoPrepare
• for NNAPI >=1.2 (Android Q and
later), if no real accelerators there,
i.e., only NNAPI CPU fallback is
there, ...
partition graph
• in the end of DoPrepare(),
ReplaceNodeSubsetsWithDele
gateKernels() is called

• DoPrepare() ->
Subgraph...
tflite::Partition() did most
partition job
• part of Partition()
https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow...
GPU GL Delegate
TfLiteRegistration
• TfLiteRegistration in
DelegatePrepare()

• init()

• no free()

• prepare() is quite ...
GPU GL Delegate
• TfLiteDelegate

• Prepare

• CopyFromBufferHandle

• CopyToBufferHandle

• class Delegate

• TFLiteGpuDele...
GPU Metal Delegate
TfLiteRegistration
• TfLiteRegistration in
DelegatePrepare()

• init()

• no free()

• prepare() is qui...
GPU Metal Delegate
• TfLiteDelegate

• Prepare: yup, just Prepare()

• class Delegate, which is quite
large

• NewGpuDeleg...
GPU delegate kernels
• GPU backends require initialization
involving shader compilation and
optimization by the driver bef...
Flex Delegate
• Another delegate is the
one that provides
selected set of ops in
Eager mode

• It’s much easier to check
w...
Edge TPU’s canned model
• supported ops are packed into
single op for Edge TPU
The compiler creates a single custom op for...
Edge TPU C++ API
https://coral.withgoogle.com/docs/edgetpu/api-intro/
EdgeTPU Delegate
• There is dynamic delegate plugin interface. Currently it’s
only used by EdgeTPU’s
https://coral.withgoo...
There still are many trivial bugs in
TensorFlow
• There are many typos in comments of TensorFlow code
• Many things are no...
Próximo SlideShare
Cargando en…5
×

de

TFLite NNAPI and GPU Delegates Slide 1 TFLite NNAPI and GPU Delegates Slide 2 TFLite NNAPI and GPU Delegates Slide 3 TFLite NNAPI and GPU Delegates Slide 4 TFLite NNAPI and GPU Delegates Slide 5 TFLite NNAPI and GPU Delegates Slide 6 TFLite NNAPI and GPU Delegates Slide 7 TFLite NNAPI and GPU Delegates Slide 8 TFLite NNAPI and GPU Delegates Slide 9 TFLite NNAPI and GPU Delegates Slide 10 TFLite NNAPI and GPU Delegates Slide 11 TFLite NNAPI and GPU Delegates Slide 12 TFLite NNAPI and GPU Delegates Slide 13 TFLite NNAPI and GPU Delegates Slide 14 TFLite NNAPI and GPU Delegates Slide 15 TFLite NNAPI and GPU Delegates Slide 16 TFLite NNAPI and GPU Delegates Slide 17 TFLite NNAPI and GPU Delegates Slide 18 TFLite NNAPI and GPU Delegates Slide 19 TFLite NNAPI and GPU Delegates Slide 20 TFLite NNAPI and GPU Delegates Slide 21 TFLite NNAPI and GPU Delegates Slide 22 TFLite NNAPI and GPU Delegates Slide 23 TFLite NNAPI and GPU Delegates Slide 24 TFLite NNAPI and GPU Delegates Slide 25 TFLite NNAPI and GPU Delegates Slide 26 TFLite NNAPI and GPU Delegates Slide 27 TFLite NNAPI and GPU Delegates Slide 28 TFLite NNAPI and GPU Delegates Slide 29 TFLite NNAPI and GPU Delegates Slide 30 TFLite NNAPI and GPU Delegates Slide 31 TFLite NNAPI and GPU Delegates Slide 32 TFLite NNAPI and GPU Delegates Slide 33 TFLite NNAPI and GPU Delegates Slide 34 TFLite NNAPI and GPU Delegates Slide 35 TFLite NNAPI and GPU Delegates Slide 36 TFLite NNAPI and GPU Delegates Slide 37
Próximo SlideShare
What to Upload to SlideShare
Siguiente
Descargar para leer sin conexión y ver en pantalla completa.

2 recomendaciones

Compartir

Descargar para leer sin conexión

TFLite NNAPI and GPU Delegates

Descargar para leer sin conexión

TensorFlow is the most popular machine learning framework nowadays. TensorFlow Lite (TFLite), open sourced in late 2017, is TensorFlow’s runtime designed for mobile devices, esp. Android cell phones. TFLite is getting more and more mature. One the most interesting new components introduced recently are its GPU delegate and new NNAPI delegate. The GPU delegate uses Open GL ES compute shader on Android platforms and Metal shade on iOS devices. The original NNAPI delegate is an all-or-nothing design (if one of the ops in the compute graph is not supported by NNAPI, the whole graph is not delegated). The new one is a per-op design. When an op in a graph is not supported by NNAPI, the op is automatically fell back to the CPU runtime. I’ll have a quick review TFLite and its interpreter, then walk the audience through example usage of the two delegates and important source code of them.

Audiolibros relacionados

Gratis con una prueba de 30 días de Scribd

Ver todo

TFLite NNAPI and GPU Delegates

  1. 1. TFLite NNAPI and GPU Delegates Koan-Sin Tan freedom@computer.org Aug 18th, 2019 COSCUP 2019, Taipei, Taiwan
  2. 2. • disclaimer: Opinions Are My Own • feel free to interrupt me if you have any questions • questions in English, Taiwanese, and Mandarin are fine • note that i am gonna skip memory related code in the talk because of time constraint. Memory management, including locality and zero-copy, is always a crucial part of high-performance computing 2
  3. 3. who i am • Used open source before the term “open source” is used • A software guy, learned to use Unix and open source software on VAX-11/780 running 4.3BSD • Used to be a programming language junkie • Worked on various system software, e.g., CPU scheduling and power management of non- CPU components • Recently, on NN performance on edge devices related stuff • Contributed from time to time to TensorFlow Lite • started a command line label_image for TFLite https://github.com/tensorflow/tensorflow/releases/tag/v2.0.0-alpha0 http://gunkies.org/w/images/c/c1/DEC-VAX-11-780.jpg 3
  4. 4. Delegation • Delegation: one of the commonly used old mechanisms mentioned in the GoF book • presumably, you know this well already • in case no, delegate definitions from dictionaries work figure from GoF, https://learning.oreilly.com/library/view/design-patterns-elements/0201633612/ch01.html#ch01lev3sec4
  5. 5. So, what is a TFLite delegate? • “A TensorFlow Lite delegate is a way to delegate part or all of graph execution to another executor.” • Why delegates? • running computation-intensive NN models on mobile devices is resource demanding for mobile CPUs, processing power and energy consumption could be problems • and matrix-multiplication which is there core of convolution and fully connected ops is highly parallel • Thus, some devices have hardware accelerators, such as GPU or DSP, that provide better performance and higher energy efficiency thru Android NNAPI • To use NNAPI, TFLite has an NNAPI delegate • Why I want to share what I know • used TFLite, contributed some code, e.g., label_image for TFLite • wrote quick-and-dirty TFLite GPU delegate benchmarks https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/g3doc/performance/delegates.md
  6. 6. What is TFLite • An lightweight inference engine • originally for Android and similar platforms. Extended to micro-controllers (e.g., ARM Cortex-M series) • Interpreter-based (what other choices do they have?) • ops are organized as a directed acyclic graph (DAG) • execute / interpret ops one bye one if no delegates involved https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/core/subgraph.cc#L734-L798
  7. 7. TfLiteContext • TfLiteContext: reporting facilities and access to global objects, including all the tensors • TfLiteNode: a single node or operation • TfLiteRegistration: the implementation of a conceptual operation https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/c/c_api_internal.h#L411-L485 ResizeTensor() ReportError() AddTensors() GetNodeAndRegistration() ReplaceNodeSubsetsWithDelegateKernels GetExternalContext() SetExternalContext() … tensors_size tensors impl_ recommended_num_threads allow_fp32_relax_to_fp16 profiler … TfLiteContext
  8. 8. TfLiteNode • TfLiteContext: reporting facilities and access to global objects, including all the tensors • TfLiteNode: a single node or operation • TfLiteRegistration: the implementation of a conceptual operation https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/c/c_api_internal.h#L377-L409 inputs outputs intermediates temporaries user_data builtin_data custom_initial_data custom_initial_data_size delegate … TfLiteNode
  9. 9. TfLiteRegistration • TfLiteContext: reporting facilities and access to global objects, including all the tensors • TfLiteNode: a single node or operation • TfLiteRegistration: the implementation of a conceptual operation https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/c/c_api_internal.h#L487-L544 init() free() prepare() invoke() profilling_string() … builtin_code custom_name version … TfLiteRegistration
  10. 10. To know more • Read [1][2] and create a custom op will help understanding TfLiteRegistration, TfLiteNode, and TfLiteContext deeper [1] https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/g3doc/guide/ inference.md#write-a-custom-operator [2] https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/g3doc/guide/ ops_custom.md
  11. 11. TfLiteDelegate: the interface • In case you didn’t notices it yet, TFLite is mainly written in C++ • C API for FFI from other high level languages • I hacked a Smalltalk one • many classes are structs and no member functions so that it could be used in C API easily https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/c/c_api_internal.h#L563-L602 Prepare() CopyFromBufferHandle() CopyToBufferHandle() FreeBufferHandler() … data_ flags … TfLiteDelegate
  12. 12. How TFLite delegates work? • Let's say we have a simple model graph such as the following: • Let's assume that there is a delegate "MyDelegate," which has a faster implementation for Conv2D and Mean operations. The resulting main graph will be updated to look like below. https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/g3doc/performance/delegates.md
  13. 13. 1×224×224×3 1×1001 TfLiteNnapiDelegate 1 32×3×3×3 2 1×3×3×512 3 512×1×1×512 4 1×3×3×512 5 512×1×1×512 6 1×3×3×512 7 1024×1×1×512 8 1×3×3×1024 9 1024×1×1×1024 10 1×3×3×32 11 64×1×1×32 12 1×3×3×64 13 128×1×1×64 14 1×3×3×128 15 128×1×1×128 16 1×3×3×128 17 256×1×1×128 18 1×3×3×256 19 256×1×1×256 20 1×3×3×256 21 512×1×1×256 22 1×3×3×512 23 512×1×1×512 24 1×3×3×512 25 512×1×1×512 26 1×3×3×512 27 512×1×1×512 28 1001 29 1001×1×1×1024 30 2 31 32 32 512 33 512 34 512 35 512 36 512 37 1024 38 1024 39 1024 40 32 41 64 42 64 43 128 44 128 45 128 46 128 47 256 48 256 49 256 50 256 51 512 52 512 53 512 54 512 55 512 56 512 57 512 input Reshape_1 What does a real model look like? • With the NNAPI delegate rewrite backed from Nov, 2018, a subgraph delegated to an “accelerator” is an op (named Delegate) in TFLite now • subgraph • all-or-nothing —> per op 1×224×224×3 1×112×112×32 1×112×112×32 1×112×112×64 1×56×56×64 1×56×56×128 1×56×56×128 1×56×56×128 1×28×28×128 1×28×28×256 1×28×28×256 1×28×28×256 1×14×14×256 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×14×14×512 1×7×7×512 1×7×7×1024 1×7×7×1024 1×7×7×1024 1×1×1×1024 1×1×1×1001 1×1001 1×1001 Conv2D weights 32×3×3×3 bias 32 DepthwiseConv2D weights 1×3×3×32 bias 32 Conv2D weights 64×1×1×32 bias 64 DepthwiseConv2D weights 1×3×3×64 bias 64 Conv2D weights 128×1×1×64 bias 128 DepthwiseConv2D weights 1×3×3×128 bias 128 Conv2D weights 128×1×1×128 bias 128 DepthwiseConv2D weights 1×3×3×128 bias 128 Conv2D weights 256×1×1×128 bias 256 DepthwiseConv2D weights 1×3×3×256 bias 256 Conv2D weights 256×1×1×256 bias 256 DepthwiseConv2D weights 1×3×3×256 bias 256 Conv2D weights 512×1×1×256 bias 512 DepthwiseConv2D weights 1×3×3×512 bias 512 Conv2D weights 512×1×1×512 bias 512 DepthwiseConv2D weights 1×3×3×512 bias 512 Conv2D weights 512×1×1×512 bias 512 DepthwiseConv2D weights 1×3×3×512 bias 512 Conv2D weights 512×1×1×512 bias 512 DepthwiseConv2D weights 1×3×3×512 bias 512 Conv2D weights 512×1×1×512 bias 512 DepthwiseConv2D weights 1×3×3×512 bias 512 Conv2D weights 512×1×1×512 bias 512 DepthwiseConv2D weights 1×3×3×512 bias 512 Conv2D weights 1024×1×1×512 bias 1024 DepthwiseConv2D weights 1×3×3×1024 bias 1024 Conv2D weights 1024×1×1×1024 bias 1024 AveragePool2D Conv2D weights 1001×1×1×1024 bias 1001 Squeeze Softmax input Reshape_1 http://localhost:8080/, http://localhost:8090/
  14. 14. delegates in TFLite • NNAPI delegate • mainly for Android • GPU delegate: NNAPI, which as introduced in Android O MR1 (late 2017), is not popular (yet) • GL ES Compute shader on Android • Metal shader on iOS • FlexDelegate: eager mode to run some ops • useful when not all ops are supported by TFLite or accelerators (thru something like NNAPI or GPU delegate) • not in TensorFlow repo: EdgeTPU delegate
  15. 15. NNAPI-enabled devices ~ 25.8% around May 7, 2019 https://developer.android.com/about/dashboards15
  16. 16. 16 GL ES compute shader capable devices ~ 50% https://developer.android.com/about/dashboards
  17. 17. Android NN API • Announced/published with Android 8.1 Preview 1 • Available to developer in NDK • yes, NDK • The Android Neural Networks API (NNAPI) is an Android C API designed for running computationally intensive operations for machine learning on mobile devices • NNAPI is designed to provide a base layer of functionality for higher-level machine learning frameworks (such as TensorFlow Lite, Caffe2, or others) that build and train neural networks • The API is available on all devices running Android 8.1 (API level 27) or higher https://developer.android.com/ndk/images/nnapi/nnapi_architecture.png 17
  18. 18. So, what a delegate is supposed to implement • Understanding how to add a delegate helps • define a kernel node, which means to implement TfLiteRegistration • create an instance of TfLiteDelegate, then register the kernel node in Prepare() typedef struct TfLiteDelegate { void* data_; TfLiteStatus (*Prepare)(TfLiteContext* context, struct TfLiteDelegate* delegate); TfLiteStatus (*CopyFromBufferHandle)(TfLiteContext* context, struct TfLiteDelegate* delegate, TfLiteBufferHandle buffer_handle, TfLiteTensor* tensor); TfLiteStatus (*CopyToBufferHandle)(TfLiteContext* context, struct TfLiteDelegate* delegate, TfLiteBufferHandle buffer_handle, TfLiteTensor* tensor); void (*FreeBufferHandle)(TfLiteContext* context, struct TfLiteDelegate* delegate, TfLiteBufferHandle* handle); int64_t flags; } TfLiteDelegate; typedef struct _TfLiteRegistration { void* (*init)(TfLiteContext* context, const char* buffer, size_t length); void (*free)(TfLiteContext* context, void* buffer); TfLiteStatus (*prepare)(TfLiteContext* context, TfLiteNode* node); TfLiteStatus (*invoke)(TfLiteContext* context, TfLiteNode* node); const char* (*profiling_string)(const TfLiteContext* context, const TfLiteNode* node); int32_t builtin_code; const char* custom_name; int version; } TfLiteRegistration;
  19. 19. NNAPI delegate • C++ code: instead of C style one • derived from TfLiteDelegate • Some private data structures • extra member functions corresponding to private data structures https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/ nnapi_delegate.h#L29-L161 Prepare() CopyFromBufferHandle() CopyToBufferHandle() FreeBufferHandler() … data_ flags … TfLiteDelegate Prepare() CopyFromBufferHandle() CopyToBufferHandle() FreeBufferHandler() GetOptions() RegisteNnMemory() GetTensorMemoryMap() … data_ flags acceleration_name (options) (memory_registration) … StateFullNnApiDelegate
  20. 20. data • execution_preference • power/perf tradeoff: not widely supported as far as I can tell • accelerator_name: e.g., “fallback” and “hvx” • cache_dir • model_token • tensor_memory_map: MemoryRegistration struct Data { // Preferred Power/perf trade-off. Options::ExecutionPreference execution_preference; // Selected NNAPI accelerator name. std::string accelerator_name; // The cache dir for NNAPI model. std::string cache_dir; // The unique token string for NNAPI model. std::string model_token; // Tensor to ANeuralNetworksMemory mapping. std::vector<MemoryRegistration> tensor_memory_map; }; // Encapsulates all fields related to memory registration for internal // bookkeeping only. struct MemoryRegistration { ANeuralNetworksMemory* memory; CopyToHostTensorFnPtr callback; void* callback_context; };
  21. 21. TfLiteRegistration for nnapi_delegate_kernel • init() • free() • prepare() • invoke() • no profiling_string() • builtin_code = … • custom_name https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/nnapi_delegate.cc#L3575-L3607 init() free() prepare() invoke() profilling_string() … builtin_code custom_name version … TfLiteRegistration
  22. 22. Init() of NNAPI Delegate Kernel • mainly for NNAPI initialization ANeuralNetworksCompilation_*() • and build graph • if NNAPI >= 1.2, checking there is “real” NNAPI device • one interesting conversion is INT8 -> UINT8 https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/nnapi_delegate.cc#L2571-L2672
  23. 23. INT8 —> UINT8 conversion • Original TFLite and NNAPI uses asymmetric UINT8 quantization • asymmetric one provides more flexibilities, but usually symmetric INT8 is more hardware friendly • more and more INT8 code for TFLite • NNAPI doesn’t change as fast as TFLite, so conversion is needed • See the quantization paper for TFLite [1] and MLIR’s quantization doc [2] [1] Jacob, B et al., ”Quantization and Training of Neural Networks for Efficient Integer- Arithmetic-Only Inference”, https://arxiv.org/abs/1712.05877 [2] https://github.com/tensorflow/mlir/blob/master/g3doc/Quantization.md
  24. 24. Invoke() of NNAPI Delegate Kernel • mainly memory management and ANeuralNetworksExecution*() • To digger more we have to go thru more TFLite and NNAPI data structures • asking NNAPI to work for you is quite trivial when everything is well-prepared https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/nnapi_delegate.cc#L2683-L2872
  25. 25. DoPrepare • for NNAPI >=1.2 (Android Q and later), if no real accelerators there, i.e., only NNAPI CPU fallback is there, computation is not offloaded. • Check for every node to see if it is supported • NN API Delegate Registration: previous pages • Request TFLite to partition the graph and make kernels for each independent node subset a new nnapi_delegate_kernel https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/nnapi/nnapi_delegate.cc#L3353-L3457
  26. 26. partition graph • in the end of DoPrepare(), ReplaceNodeSubsetsWithDele gateKernels() is called • DoPrepare() -> Subgraph::ReplaceNodeSubs etsWithDelegateKernels() -> tflite::PartitionGraphIntoIndepe ndentNodeSubsets() -> tflite::Partition() https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/core/ subgraph.cc#L298-L363
  27. 27. tflite::Partition() did most partition job • part of Partition() https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/graph_info.cc#L67-L118
  28. 28. GPU GL Delegate TfLiteRegistration • TfLiteRegistration in DelegatePrepare() • init() • no free() • prepare() is quite simple • invoke(): simply calls node- >Invoke() • context -> ReplaceNodeSubsetsWithDele gateKernels() https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L392-L431
  29. 29. GPU GL Delegate • TfLiteDelegate • Prepare • CopyFromBufferHandle • CopyToBufferHandle • class Delegate • TFLiteGpuDelegateCreate() https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L75-L457 https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L464-L470 https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L75-L457
  30. 30. GPU Metal Delegate TfLiteRegistration • TfLiteRegistration in DelegatePrepare() • init() • no free() • prepare() is quite simple • invoke(): simply calls node- >Invoke() • context -> ReplaceNodeSubsetsWithDele gateKernels() https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/gl_delegate.cc#L392-L431
  31. 31. GPU Metal Delegate • TfLiteDelegate • Prepare: yup, just Prepare() • class Delegate, which is quite large • NewGpuDelege() https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/metal_delegate.mm#L525-L532 https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/metal_delegate.mm#L620-L624 https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/gpu/metal_delegate.mm#L163-L613
  32. 32. GPU delegate kernels • GPU backends require initialization involving shader compilation and optimization by the driver before inference • PHWC4: P stands for plane • Reshape is expensive on GPU • RGBA is better than RGB on GPU • a tensor of shape [B,H,W,5], for instance, is twice as expensive as [B, H, W, 4], but about the same as [B, H, W, 8], then the architect can tune around those 4-channel boundaries rather than trying to optimize on other boundaries. • https://arxiv.org/pdf/1907.01989.pdf
  33. 33. Flex Delegate • Another delegate is the one that provides selected set of ops in Eager mode • It’s much easier to check what it does https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/flex/delegate.cc#L143-L148 https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/lite/delegates/flex/kernel.cc#L561-L573
  34. 34. Edge TPU’s canned model • supported ops are packed into single op for Edge TPU The compiler creates a single custom op for all Edge TPU compatible ops; anything else stays the same https://coral.withgoogle.com/docs/edgetpu/models-intro/ 34 MobileNet V1 1×224×224×3 1×1001 edgetpu-custom-op input Softmax 1×300×300×3 1×1917×91 1×10×4 1×10 1×10 1 edgetpu-custom-op TFLite_Detection_PostProcess 3 1917×4 normalized_input_image_tensor TFLite_Detection_PostProcess TFLite_Detection_PostProcess:1 TFLite_Detection_PostProcess:2 TFLite_Detection_PostProcess:3 SSD MobileNet V1
  35. 35. Edge TPU C++ API https://coral.withgoogle.com/docs/edgetpu/api-intro/
  36. 36. EdgeTPU Delegate • There is dynamic delegate plugin interface. Currently it’s only used by EdgeTPU’s https://coral.withgoogle.com/docs/edgetpu/api-intro/
  37. 37. There still are many trivial bugs in TensorFlow • There are many typos in comments of TensorFlow code • Many things are not well-documented • There are many many warnings when building TensorFlow from source code • a trivial fix in May, 2019 by me 37 https://github.com/tensorflow/tensorflow/pull/28618
  • gibsson

    Nov. 30, 2020
  • chithize

    Mar. 5, 2020

TensorFlow is the most popular machine learning framework nowadays. TensorFlow Lite (TFLite), open sourced in late 2017, is TensorFlow’s runtime designed for mobile devices, esp. Android cell phones. TFLite is getting more and more mature. One the most interesting new components introduced recently are its GPU delegate and new NNAPI delegate. The GPU delegate uses Open GL ES compute shader on Android platforms and Metal shade on iOS devices. The original NNAPI delegate is an all-or-nothing design (if one of the ops in the compute graph is not supported by NNAPI, the whole graph is not delegated). The new one is a per-op design. When an op in a graph is not supported by NNAPI, the op is automatically fell back to the CPU runtime. I’ll have a quick review TFLite and its interpreter, then walk the audience through example usage of the two delegates and important source code of them.

Vistas

Total de vistas

3.364

En Slideshare

0

De embebidos

0

Número de embebidos

12

Acciones

Descargas

73

Compartidos

0

Comentarios

0

Me gusta

2

×