开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 人工智能 -> 想要的AI部署前沿技术都在这里了！ -> 正文阅读

[人工智能]想要的AI部署前沿技术都在这里了！

不得不相信英伟达总能给我们惊喜，老潘作为一名深度学习从业者以及游戏爱好者，对于这种与AI、GPU、并行计算相关的话题一直都是比较感兴趣。作为深度学习第一大硬件平台的英伟达，我们自然熟悉的不能再熟悉了。

NVIDIA-AI

英伟达的硬件足够优秀，但也需要足够的软件开发者去支撑才能真正的令其发展起来，对此英伟达是“上心的”。因为毕竟只有大家都用、大家都觉着好用、你也觉着好用，你才会用英伟达的产品，才会用它的GPU去做一些你感兴趣的事情(当然英伟达就可以卖钱啦)。

这也就是生态。

其实用了比较长时间英伟达的产品(显卡、嵌入式版本系列)，NVIDIA对于这些硬件产品的软件支持还是很友好很开放的。

我们平常常用的TensorRT，还有各种工具包例如DeepDream、TLT(TRANSFER LEARNING TOOLKIT)、Triton-Server-Inference等都是英伟达提供给我们开箱即用的工具，也确实好用。唯一想吐槽的就是开源不彻底(轻喷轻喷~)。

NVIDIA的TLT相关

这里也推荐几个NVIDIA最新研究(待开源以及以开源)的项目地址，非常适合找灵感：

https://www.nvidia.com/en-us/on-demand/
https://www.nvidia.com/en-us/research/ai-playground/
https://developer.nvidia.com/transfer-learning-toolkit

英伟达之你想要的

NVIDIA有很多新的技术或者即将开源的技术(未来的半年内)绝大部分都会在这里展示出来，称做On-Demand。为啥叫ON-DEMAND？简单查了一下英文字典，大意是“按需索取”。可以理解为，也就是你想要或者感兴趣的技术，大概都在这里了！

NVIDIA-on-demand

有事没事经常翻一翻看一看，对于想要实时跟进最新深度学习部署、加速、创新技术的小伙伴，这是个很好的学习地儿。ON-DEMAND的演讲课程大概分为以下这几类：

自动驾驶、机器人
大数据、网络、可视化
数据科学
深度学习
GPU编程
图形图像以及设计
高性能计算
仿真、思维

而老潘研究的当然是深度学习、AI、推理加速这类的，所以我这里收集了最近部署相关(TensorRT、ONNXRUNTIME、稀疏化、量化、Polygraphy)NVIDIA的活动演讲PPT，以及简单的介绍，每个演讲都有PPT和视频可以看，资源算是比较丰富了。

Nvidia-on-demand-ppt-list

其中的部分技术老潘之前也提到过。

说了这么多，其实本文的目的是介绍一些NVIDIA最近半年一些新的技术点，还有一些技术的知识储备，老潘觉着很实用很有收获，这里分享给大家。

下面每一个大标题对应着课程的题目，在ON-DEMAND中都可以找到，有相应的PPT以及视频演讲。PPT可以单独下载，不想单独下载的(需要账户)，老潘也都整理到了文末。

TensorRT Quick Start Guide

We’ll walk you through the TensorRT Quick Start Guide. The newly-published TensorRT Quick Start Guide provides a quick introduction to new users starting out with TensorRT. It includes Jupyter notebooks and C++ examples of the most common TensorRT workflows and examples for using TensorRT with TensorFlow, PyTorch, and ONNX.

官方推出的TensorRT入门指南，很权威也很直接。包括notebook的使用以及C++的丰富样例，提供和Pytorch、TensorFlow以及ONNX一起搞的例子，对于新手再合适不过了。

不过老潘也写了关于TensorRT的入门介绍，不想看英文版的可以看看这个：内卷成啥了还不知道TensorRT？超详细入门指北，来看看吧！

TensorRT的更新还是挺快的，老潘在写TensorRT-7.2.3.4的时候，TensorRT已经悄悄出了8的EA版。正如老潘之前提到的，那会出来的是TensorRT-EA版，隔了半个月就马上GA了，更新神速！现在已经是TensorRT-8GA版本，不论是性能还是易用性，相比上一个版本都有所提升。

具体有哪些提升，如何使用如何开启这些新功能，英伟达自家当然都要介绍一番的。

下面就是。

Accelerate Deep Learning Inference with TensorRT 8.0

TensorRT is an SDK for high-performance deep learning inference used in production to minimize latency and maximize throughput. The upcoming TensorRT 8.0 release provides features such as sparsity optimized for NVIDIA Ampere GPUs, quantization-aware training, and enhanced compiler to accelerate transformer-based networks. Deep learning compilers need to have a robust method to import, optimize, and deploy models. New users can learn about the common workflow, while experienced users can learn more about new TensorRT 8.0 features.

TensorRT8终于是发布了，英伟达官方也着重宣传了一番，不光有博客，也有PPT以及相应的课程说明。

TensorRT

目前是GA版本，TensorRT-8.0.1.6版。8版本相比7版本，重大的变化有三个：

支持QTA量化(也就是训练中量化)，可以直接将其他框架中训练中量化的模型导入到TensorRT中使用
对于安培(Ampere的)架构的显卡，支持稀疏化网络，可提升50%的吞吐量
对于BERT等transformer构架的网络有了更好的优化

TensorRT8的变动还是蛮大的，毕竟是大版本的更新。详细的内容可以先看这个演讲PPT。老潘之后也会详细介绍下(埋坑嘻嘻)。

Introduction to TensorRT and Triton: A Walkthrough of Optimizing Your First Deep Learning Inference Model

NVIDIA TensorRT is a deep learning platform that optimizes neural network models and speeds up inference across GPU-accelerated platforms running in the data center and embedded devices. We’ll provide an overview of TensorRT, show how to optimize a PyTorch model, and demonstrate how to deploy this highly optimized model using NVIDIA Triton Inference Server. By the end of this workshop, developers will see the substantial benefits of integrating TensorRT and get started on optimizing their own deep learning models.

如果说TensorRT是优秀的推理框架，那么Triton就是同样优秀的服务器框架。

我有一个TensorRT！你有一个Triton！那么合起来呢？就是triton with TensorRT！两者结合起来可以称之为开源届最强服务器推理方案。

Triton确实是好用的不行。Triton server的特性与其他服务器框架无异，而支持的底层backend有TensorRT、onnxruntime、libtorch、TensorFlow、Pytorch、Openvino等，支持http和grpc协议，也可以自定义协议(毕竟开源嘛)，支持多卡，支持多实例，支持热加载。

再列一下主要的特点，相比于之前老牌服务器TensorFlow-Server已经是完全不相上下了：

Triton-server特点

triton最新版21.06的特性：

triton新特性

毕竟Triton-server是NVIDIA自家的，对于TensorRT这个后端支持是最完善最高效的，如果TensorRT模型需要服务部署，那么triton-server是第一选择。

整个triton的流程也特别简单，只要熟悉了这一套，之后模型部署起来特别快：

triton流程

总之triton是一个优秀的开源服务框架，TensorRT的服务器部署第一选择就是它，当然其他后端也能无缝使用。

关于triton，老潘之后也会详细介绍一下。

Quantization Aware Training in PyTorch with TensorRT 8.0

Quantization is used to improve latency and resource requirements of Deep Neural Networks during inference. Quantization Aware Training (QAT) improves accuracy of quantized networks by emulating quantization errors in the forward and backward passes during training. TensorRT 8.0 brings improved support for QAT with PyTorch, in conjunction with NVIDIA’s open-source pytorch-quantization toolkit. This session gives an overview of the improvements in QAT with TensorRT 8.0, and walks through an end-to-end usage example.

训练中量化(QTA)是TensorRT8中的一个新特性，相比之前TensorRT7提供的训练后量化(PTQ，也就是通过部分数据集校准来进行量化)，训练中量化可以更好地寻找量化尺度信息。

感知量化的过程

上图我们可以知道，插入一个量化算子，负责将输入的FP32张量量化为INT8，随后再进行反量化将INT8的张量在变为FP32。实际网络中训练使用的精度还是FP32，只不过这个量化算子在训练中可以学习到量化和反量化的尺度信息，相比直接训练后校准，这个值找的更准一些。

量化流程

TensorRT8可以直接加载通过QTA量化后且导出为ONNX的模型，官方也提供了Pytorch量化配套工具，可谓是一步到位。

具体的量化细节可以看这个课程，老潘之后也会写一篇量化的文章，继续埋坑~

Making the Most of Structured Sparsity in the NVIDIA Ampere Architecture

In this session, we’ll share details of Sparse Tensor Cores in the NVIDIA Ampere Architecture and the unique 2:4 sparse format they support. Learn how we’ve simplified maintaining accuracy when pruning all types of networks, including classification networks, language models, and GANs. Finally, find out how to accelerate your own workloads using Sparse Tensor Cores from start to finish with ASP and TensorRT 8.0 and cuSPARSELt.

稀疏化类似于剪枝和量化，有一定的正则化作用，可以减少模型的参数，有时候也可以提升模型的运行速度。只不过稀疏化是在TensorRT内部进行随机失活，将部分Tensor值变为0，相比于剪枝，需要特殊的硬件才可以加速。

Tensor稀疏化推理流程

稀疏化这个概念在很早之前就出现了，而CUDA对稀疏化的支持在前年就有了。

英伟达部分显卡是支持稀疏化推理的，英伟达的A100 GPU显卡在运行bert的时候，稀疏化后的网络相比之前的dense网络要快50%。我们的显卡支持么？只要是Ampere architecture架构的显卡都是支持的(例如30XX显卡)。

利用30系列显卡在实际计算中，可以将我们提前稀疏了的矩阵再Compressed，组成压缩后的矩阵以及索引表，随即可以通过硬件和往常一样计算。因为参数的明显，计算速度相比之前肯定是快了不少。

2-4-structured-sparse-matrix

稀疏的好处

利用 NVIDIA 安培结构和 NVIDIA TensorRT 加速稀疏推理

Prototyping and Debugging Deep Learning Inference Models Using TensorRT’s ONNX-Graphsurgeon and Polygraphy Tools

Deep learning researchers and engineers usually have to spend a significant amount of time debugging accuracy and performance of their deep learning inference models before deploying them. TensorRT recently open-sourced some more tools to assist with the development and debugging of deep neural networks for inference. ONNX GraphSurgeon is a tool that allows you to easily generate new ONNX graphs, or modify existing ones. This can be useful in scenarios like using custom implementations for parts of the ONNX graph, in place of those provided by TensorRT. Polygraphy is a toolkit designed to assist in running and debugging deep learning models in various frameworks. It includes a Python API and several command-line tools built using this API. These tools allow displaying information about models, such as network structure; determining which layers of a TensorRT network need to be run in a higher precision for accuracy; and comparing inference results across frameworks, among other features.

Polygraphy是一个非常强大的工具。强烈推荐，这个工具可能会在工作中省掉你一半debug的时间。目前关于这个工具的推广和介绍并不是很多，很多人还不知道。

看看这个工具能干啥：

polygraphy所有的功能

可以看ONNX、TRT的网络结构，可以修改、简化ONNX模型，可以查找debug转换好的trt模型有什么问题…总之，如果你是trt和ONNX的重度使用者，这个工具千万不能错过。从事类似行业的，这个工具的思想也很值得借鉴！

简单列几个命令~

查看ONNX结构 polygraphy inspect model mymodel.onnx
查看一个engine结构 polygraphy inspect model mytrt.trt --model-type engine
通过onnx查看生成trt的网络结 polygraphy inspect model mymodel.onnx --display-as=trt --mode basic
对于trt和onnx的结果
首先生成onnx的结果信息
polygraphy run mymodel.onnx --onnxrt --save-outputs onnx_res.json
然后转一个模型进行对比
polygraphy run mytrt.trt --model-type engine --trt --load-outputs onnx_res.json --abs 1e-4
修改onnx结构
polygraphy surgeon sanitize modele2-nms.onnx
–override-input-shapes input_name:[1,3,224,224]
-o modele2-nms-static-shape.onnx

更多的介绍请看PPT或者等待老潘的介绍文~

Achieve Best Inference Performance on NVIDIA GPUs by Combining TensorRT with TVM Compilation Using SageMaker Neo

Amazon SageMaker Neo allows customers to compile models from any framework for optimized inference on many compilation targets, including NVIDIA Jetson devices and T4 GPU instances. We’ll dive into the details of how Neo uses the open-source deep learning compiler TVM and NVIDIA TensorRT together to provide the best inference performance across popular deep learning model types.

TVM和TensorRT的结合，想想就会有很强大。TVM和TensorRT作为业界数一数二的加速推理框架，两者结合起来又有什么样的火花呢？

TVM老潘之前提到过，极其优秀的深度学习编译器。TensorRT更不用说。这两者结合和我想象中的一样，是类似于integration或者Partitioning的方式。部分计算图运行在TVM、部分运行在TensorRT中，两者取所长。

TensorRT+TVM

不过也不要想太美，毕竟理想很丰满现实却很骨感。老潘这里也有一些case使用过两者的结合(比较复杂的模型)，但是用不了。TensorRT优化差或者op不支持的TVM也不支持，我也没有时间去研究自己写，只能尝试其他方法了。

不过也不能绝对，毕竟每个人的模型不一样，大家如果遇到网络中个别op，TensorRT不支持或者TVM不支持的情况，不妨先尝试看看。

New Features in TRTorch, a PyTorch/TorchScript Compiler Targeting NVIDIA GPUs Using TensorRT

We’ll cover new features of TRTorch, a compiler for PyTorch and TorchScript that optimizes deep learning models for inference on NVIDIA GPUs. Programs are internally optimized using TensorRT but maintain full compatibility with standard PyTorch or TorchScript code. This allows users to continue to feel like they’re writing PyTorch code in their inference applications while fully leveraging TensorRT. We’ll discuss new capabilities enabled in recent releases of TRTorch, including direct integration into PyTorch and post-training quantization.

TRTorch，刚开始看到这个名字感觉很奇怪。后来仔细了解了下，这个库对于特定场景是比较实用的，转TRT的流程变为：

Pytorch->torchscript->tensorrt

我们又多了一条路子转Pytorch模型到TRT啦！

我之前用过torch2trt这个工具来转换pytorch模型到trt，那么TRTORCH对我来说有什么用么？总之都是pytorch->trt，为什么不直接用torch2trt呢？还要这么复杂通过一层torchscript？

其实TRTorch很适合一种场景，那就是pytorch有一些op，TensorRT不支持，而又不好绕过去。这样我们就可以在TRTorch中设置切割子图，使计算图一部分运行在TensorRT中而另一部分运行在libtorch中。

是不是很像上一个TVM+TensorRT的功能，不过这些个功能都不是很成熟，很多bug，目前来看还不是很好用…

四大准则

Low-Latency, High-Throughput Inferencing for Transformer-Based Models

Transformer-based models provide state-of-the-art accuracy for many NLP tasks. Recent models contain a large number of parameters, which makes meeting low latency requirements challenging for online inferencing. We’ll cover highly optimized inferencing solutions for transformer-based models to tackle online and offline inferencing scenarios. We’ll demonstrate that low latency and high throughput can be achieved with the combination of NVIDIA hardware and software. We’ll briefly go over BERT inferencing with FasterTransformer, TensorRT, and MXNet, and also present performance data from the latest NVIDIA GPUs.

Transformer也不用多说，目前为止最好用的encoder和decoder集合体。基于transformer的模型也有很多，BERT便是最出名的一个，不光是NLP，在其他任务中，只要涉及编码或者解码的部分都可以无脑使用transformer提升模型精度。虽然transformer速度快精度高符合GPU的计算特性，唯一不足的就是速度相比纯卷积还不是很快。

TensorRT8针对Transformer结构进行了更深度的优化，值得试试：

TensorRT8对transformer的加速

TensorRT关于Transformer的开源项目如下：

https://github.com/NVIDIA/FasterTransformer
https://github.com/NVIDIA/TensorRT/tree/master/demo/BERT

Inference with Tensorflow 2 Integrated with TensorRT Session

Learn how to inference using Tensorflow 2 with TensorRT integrated and the performance this can offer. Tensorflow is a machine learning platform and TensorRT is an SDK for high-performance deep learning inference using NVIDIA GPUs. Tensorflow models are usually written in FP32 precision to work for both training and inference. Tensorflow-TensorRT integration automatically offloads portions of the Tensorflow graph to run with TensorRT using precisions FP16 or INT8 to improve inference throughput without sacrificing much accuracy. We’ll describe: how to use Tensorflow-TensorRT integration in Tensorflow 2; the dynamic shape feature we recently added to better handle Tensorflow graph with unknown shapes; the lazy calibration mode we recently added to improve the workflow for inferencing with INT8 precision; some details on how Tensorflow-TensorRT works; and the performance benefits of using Tensorflow-TensorRT for inference.

TF-TRT

TensorFlow2老潘不是很熟悉，这里也就不多说了。不过对于使用TensorFlow2的童鞋们来说，使用TRT加速更加方便了，更多详细的内容可以看PPT。

Designing and Optimizing Deep Neural Networks for High-Throughput and Low-Latency Production Deployment

When integrating DNNs into applications the project teams need to consider much more than just model accuracy. Factors such as throughput affect the size and the cost of the infrastructure required to host the application. Similarly, latency of model response is important for a wide range of time-sensitive application and a hard requirement when building safety-critical applications. We’ll discuss how to select efficient models that allow us to meet the throughput and latency requirements (including multitask DNNs) as well as key approaches for their further optimization, such as quantification-aware training, post-training quantification, pruning, distillation, and other forms of model compression. We’ll explain how those techniques interact with the GPU architecture. Finally, we’ll reprise key tools that can simplify the model optimization and deployment process, such as TensorRT or Triton Inference Server.

如何设计并且优化高吞吐低延迟的产品模型，涉及到了TensorRT以及Triton Inference Server。

现在的模型越来越大了，没办法想要高精度，必须上大模型。

越来越大的模型是未来的趋势