TensorFlow Lite model optimization (模型优化)

Edge devices often have limited memory or computational power. Various optimizations can be applied to models so that they can be run within these constraints. In addition, some optimizations allow the use of specialized hardware for accelerated inference.
边缘设备通常具有有限的内存或计算能力。可以将各种优化应用于模型，以便它们可以在这些约束内 (内存，能耗和模型存储) 运行。此外，一些优化允许使用专用硬件进行加速推理。

TensorFlow Lite and the TensorFlow Model Optimization Toolkit provide tools to minimize the complexity of optimizing inference.
Tensorflow Lite 和 Tensorflow Model Optimization Toolkit 提供了最小优化推理复杂性的工具。

Optimize machine learning models
https://tensorflow.google.cn/model_optimization

It’s recommended that you consider model optimization during your application development process. This document outlines some best practices for optimizing TensorFlow models for deployment to edge hardware.
建议您在应用程序开发过程中考虑模型优化。本文档概述了一些优化 TensorFlow 模型以部署到边缘硬件的最佳实践。

1. Model optimization

There are several main ways model optimization can help with application development.

1.1 Size reduction

Some forms of optimization can be used to reduce the size of a model. Smaller models have the following benefits:

Smaller storage size: Smaller models occupy less storage space on your users’ devices. For example, an Android app using a smaller model will take up less storage space on a user’s mobile device.
Smaller download size: Smaller models require less time and bandwidth to download to users’ devices.
Less memory usage: Smaller models use less RAM when they are run, which frees up memory for other parts of your application to use, and can translate to better performance and stability.

Quantization can reduce the size of a model in all of these cases, potentially at the expense of some accuracy. Pruning and clustering can reduce the size of a model for download by making it more easily compressible.

1.2 Latency reduction

Latency is the amount of time it takes to run a single inference with a given model. Some forms of optimization can reduce the amount of computation required to run inference using a model, resulting in lower latency. Latency can also have an impact on power consumption.

Currently, quantization can be used to reduce latency by simplifying the calculations that occur during inference, potentially at the expense of some accuracy.

1.3 Accelerator compatibility

Some hardware accelerators, such as the Edge TPU, can run inference extremely fast with models that have been correctly optimized.

Generally, these types of devices require models to be quantized in a specific way. See each hardware accelerator’s documentation to learn more about their requirements.