使用onnx c++部署pytorch神经网络模型全流程

简介

Open Neural Network Exchange（ONNX，开放神经网络交换）格式，是一个用于表示深度学习模型的标准，可使模型在不同框架之间进行转移。如pytorch模型转换为caffe模型，python模型c++调用等等。
从我的理解来看，ONNX实现了一些神经网络的算子，通过保存计算图，即N节点到N+1节点的维度变换，kernel尺寸，从而推算出这一步进行的是什么操作，调用算子进行推理。
（更详细的源码理解，之后有空再学习一下）

步骤

项目中需要将pytorch模型部署在C++系统里。因此做了一些使用onnx库实现该步骤的工作，在这里记录整理一下。

具体的实现步骤主要有两个部分：

python环境中，将pytorch模型推理过程记录为onnx模型计算图，保存为后缀.onnx文件。
c++环境中，使用C++ onnxruntime库调用刚才保存的onnx文件，实现推理。

python

import torch
import network

# ================ 生成 ==========================
# 生成假输入，只需要尺寸一致即可，因为onnx只保存计算图
dummy_input1 = torch.randn(1, 1, 224, 224)
dummy_input2 = torch.randn(1, 1, 60, 60)
dummy_input3 = torch.randn(1, 1, 256)
# 实例化神经网络，假设有一个三输入，三输出的网络
net = network()
# 生成onnx模型
torch.onnx.export(net,
                  (dummy_input1, dummy_input2, dummy_input3),
                  "net.onnx", 
                   export_params=True,        # 是否保存训练好的参数在网络中
                   opset_version=10,          # ONNX算子版本
                   do_constant_folding=True,  # 是否不保存常数输出（优化选项）
                   input_names = ['input0', 'input1', 'input2'],   
                   output_names = ['output0', 'output1', 'output2'])

# ================  验证 ==============================
import onnxruntime
import numpy as np
onnx_session = onnxruntime.InferenceSession('net.onnx')
# 因为此时已经不是使用torch进行推理了，所以输入不再是tensor
input0 = np.random.randn(1, 1, 224, 224)
input1 = np.random.randn(1, 1, 60, 60)
input2 = np.random.randn(1, 1, 256)
input_name0 = onnx_session.get_inputs()[0].name
input_name1 = onnx_session.get_inputs()[1].name
input_name2 = onnx_session.get_inputs()[2].name
output_name0 = onnx_session.get_outputs()[0].name
output_name1 = onnx_session.get_outputs()[1].name
output_name2 = onnx_session.get_outputs()[2].name
# 使用Onnx模型推理
res = onnx_session.run([output_name0, output_name1, output_name2],
                       {input_name0: input0, 
                        input_name1: input1,
                        input_name2: input2})
output0 = res[0]
output1 = res[1]
output2 = res[2]

c++

#include 
#include 
#include 
#include 

int main(int argc, char* argv[])
{
	//设置为VERBOSE，方便控制台输出时看到是使用了cpu还是gpu执行
    Ort::Env env(ORT_LOGGING_LEVEL_VERBOSE, "test");
    Ort::SessionOptions session_options;
    // 使用五个线程执行op,提升速度
    session_options.SetIntraOpNumThreads(5);
    // 第二个参数代表GPU device_id = 0，注释这行就是cpu执行
    OrtSessionOptionsAppendExecutionProvider_CUDA(session_options, 0);
    // ORT_ENABLE_ALL: To Enable All possible opitmizations
    session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
    
#ifdef _WIN32
    const wchar_t* model_path = L"net.onnx";
#else
    const char* model_path = "net.onnx";
#endif

    Ort::Session session(env, model_path, session_options);
    // 获得模型有多少个输入和输出，因为是三输入三输出网络，那么input和output数量都为3
    Ort::AllocatorWithDefaultOptions allocator;
    size_t num_input_nodes = session.GetInputCount();
    size_t num_output_nodes = session.GetOutputCount();
    
    std::vector input_node_names(num_input_nodes);
    std::vector output_node_names(num_output_nodes);
    std::vector> input_node_dims_vector;
    std::vector> output_node_dims_vector;
    std::vector input_node_dims_sum;
    std::vector output_node_dims_sum;
    int64_t input_node_dims_sum_all{ 1 };
    int64_t output_node_dims_sum_all{ 1 };
    
    // 获取所有输入层信息
    for (int i = 0; i < num_input_nodes; i++) {
        // 得到输入节点的名称 char*
        char* input_name = session.GetInputName(i, allocator);
        input_node_names[i] = input_name;
        
        Ort::TypeInfo type_info = session.GetInputTypeInfo(i);
        auto tensor_info = type_info.GetTensorTypeAndShapeInfo();
        // 得到输入节点的数据类型
        ONNXTensorElementDataType type = tensor_info.GetElementType();
        
		// 得到输入节点的输入维度 std::vector
        input_node_dims = tensor_info.GetShape();
        input_node_dims_vector.emplace_back(input_node_dims);
        int64_t sums{ 1 };
        // 得到输入节点的输入维度和，后面要使用 int64_t
        for (int j = 0; j < input_node_dims.size(); j++) {
            sums *= input_node_dims[j]);
        }
        input_node_dims_sum.emplace_back(sums);
        input_node_dims_sum_all *= sums;
    }
	// 迭代所有输出层信息
    for (int i = 0; i < num_output_nodes; i++) {
        // 得到输出节点的名称 char*
        char* output_name = session.GetOutputName(i, allocator);
        output_node_names[i] = output_name;
        
        Ort::TypeInfo type_info = session.GetOutputTypeInfo(i);
        auto tensor_info = type_info.GetTensorTypeAndShapeInfo();
        // 得到输出节点的数据类型
        ONNXTensorElementDataType type = tensor_info.GetElementType();
        
		// 得到输出节点的输入维度 std::vector
        output_node_dims = tensor_info.GetShape();
        output_node_dims_vector.emplace_back(output_node_dims);
         int64_t sums{ 1 };
        // 得到输出节点的输入维度和，后面要使用 int64_t
        for (int j = 0; j < output_node_dims.size(); j++) {
            sums *= output_node_dims[j]);
        }
        output_node_dims_sum.emplace_back(sums);
        output_node_dims_sum_all *= sums;
    }
	
	// 假设输入为三个 std::vector> inputs 创建输入tensor (假设输入为1*1*224*224)
	// 第二个参数代表输入数据 float*
    // 第三个参数代表输入节点的总尺寸 int64_t (1*1*224*224)
    // 第四个参数代表输入节点的尺寸数据 std::vector (vector(1, 1, 224, 224))
    // 最后一个参数代表输入节点的尺寸维度数目 size_t (4)
	std::vector ort_inputs;
	auto memory_info = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);
    for (size_t i = 0; i < num_input_nodes; i++) {
        Ort::Value input_tensor = Ort::Value::CreateTensor(memory_info, inputs[i].data(), 
        input_node_dims_sum[i], input_node_dims[i].data(), input_node_dims[i].size());
        assert(input_tensor.IsTensor());
        ort_inputs.emplace_back(input_tensor);
    }
   
    // 推理
    // 第一个参数代表运行配置
    // 第二个参数代表输入节点的名称集合
    // 第三个参数代表输入Tensor地址
    // 第四个参数代表输入节点的数目
    // 第五个参数代表输出节点的名称集合
    // 最后一个参数代表输出节点的数目
    std::vector output_tensors = session.Run(Ort::RunOptions{ nullptr }, input_node_names.data(), &input_tensor, num_input_nodes, output_node_names.data(), num_output_nodes);
    assert(output_tensors.size() == 3 && output_tensors[0].IsTensor() && output_tensors[1].IsTensor() && output_tensors[2].IsTensor());
    
    // 获取输出
    float* output0 = output_tensors[0].GetTensorMutableData();
    float* output1 = output_tensors[1].GetTensorMutableData();
    float* output2 = output_tensors[2].GetTensorMutableData();
}

坑

不支持Conv4d操作 -> 将Conv4d改成函数，不使用torch.nn.conv继承的那一套。
只支持单变量输入，输入list,tuple都会被解析成单变量输入。
不支持大于1的步长的切片赋值操作，如x[::2] = 1（但是onnxruntime1.6测试又是可以的，存疑）
不支持torch.nn.functional.grid_sample函数 -> 用mmvs.ops.point_sample.bilinear_grid_sample替代
不支持分支操作，如TopK中，K取1024，如果输入小于1024就会报错。
C++ onnx的Session和Value类都是不支持移动构造和复制构造的，必须在最外层的函数去创建和使用。不能包装在子函数里，std::move也不行。
pytorch对于index的类型是int64，但是必须要把这个值转成float32类型的，C++中的Onnxruntime才能输出正确值，不然会输出NAN！！！所以遇到索引之后的部分只能改写了（官网说onnx opset12及以后之后已经支持Int64输出了，不知道是不是版本的原因,onnxruntime=1.5.2不可以, opset=12， onnxruntime-python=1.6.0是可以的）

还有一些其他博主记录的

https://www.cnblogs.com/xiaxuexiaoab/p/15654972.html
https://blog.csdn.net/zzz_zzz12138/article/details/109138805
https://blog.csdn.net/JoeyChen1219/article/details/121141318

杂言

有些单输入的网络（如目标检测yolo、人脸识别网络）也支持使用opencv.dnn.readNetFromOnnx实现推理）
实际上使用Libtorch库来实现会更准确也更简单，在实验中，使用Libtorch库推算，C++代码和python代码的误差为0，但是ONNX库推算会带来1e-5左右的误差，所以ONNX库推算不太适合需要特别精确数值的任务）。因为也调研了一些pytorch + libtorch库实现C++调用pytorch模型推理的工作，这里也浅浅记录一下吧。
主要的步骤也是两步：

python环境下，调用torch.jit.trace或者torch.jit.script创建pt模型。
c++环境下，使用libtorch调用刚才保存的pt模型进行推理。

一些有用的博客：

官方文档
[PyTorch]PyTorch的C++前端和模型部署

参考

（一个有意思的内容）加密解密Onnx模型
onnx转移动端模型ort部署
ONNX动态输入和输出
ONNX使用pytorch导出的模型进行推理
PyTorch模型C++部署
https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/test/shared_lib/test_inference.cc

使用onnx c++部署pytorch神经网络模型全流程

C/C++/C#相关栏目本月热门文章