[人工智能] stable-baselines3学习之自定义策略网络（Custom Policy Network）

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 人工智能 -> stable-baselines3学习之自定义策略网络（Custom Policy Network） -> 正文阅读

[人工智能]stable-baselines3学习之自定义策略网络（Custom Policy Network）

stable-baselines3学习之自定义策略网络（Custom Policy Network）

stable-baselines3为图像 (CnnPolicies)、其他类型的输入特征 (MlpPolicies) 和多个不同的输入 (MultiInputPolicies) 提供policy networks。

1.SB3 policy

SB3网络分为两个主要部分：

一个特征提取器（通常在适用时在actor和critic之间共享），作用是从高维observation中提取特征转换为特征向量，例如用CNN从图像中提取特征。使用features_extractor_class参数，通过传递features_extractor_kwargs参数可以改变特征提取器的默认参数。
一个全连接网络，映射特征到action或者value，它的网络结构由net_arch参数控制。

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-XrkpoqJi-1647865227156)(C:\Users\admin\AppData\Roaming\Typora\typora-user-images\image-20220321153912964.png)]$

SB3 policies通常由多个网络（actor/critic+target network（适用时））和optimizers组成，这些网络都有一个feature extractor和一个fully-connected network。

注：在SB3中的提到的policy并不是指RL中actor对应的那个policy，而是所有训练中用到的网络的类。

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BBTXizUt-1647865227157)(C:\Users\admin\AppData\Roaming\Typora\typora-user-images\image-20220321154458483.png)]$

2.自定义网络结构

自定义策略网络架构的一种方法是在创建模型时使用policy_kwargs传递参数：

import gym
import torch as th

from stable_baselines3 import PPO

# Custom actor (pi) and value function (vf) networks
# of two layers of size 32 each with Relu activation function
policy_kwargs = dict(activation_fn=th.nn.ReLU,
                     net_arch=[dict(pi=[32, 32], vf=[32, 32])])
# Create the agent
model = PPO("MlpPolicy", "CartPole-v1", policy_kwargs=policy_kwargs, verbose=1)
# Retrieve the environment
env = model.get_env()
# Train the agent
model.learn(total_timesteps=100000)
# Save the agent
model.save("ppo_cartpole")

del model
# the policy_kwargs are automatically loaded
model = PPO.load("ppo_cartpole", env=env)

3.自定义特征提取器

如果你想有一个自定义的特征提取器（例如使用图像时自定义 CNN），你可以定义派生自BaseFeaturesExtractor的类，然后在训练时将其传递给模型。

注：默认情况下，特征提取器在actor和critic之间共享以节省计算（如果适用）。但是，在on-policy 算法定义自定义policy时或者在policy_kwargs中设置share_features_extractor=False的off-policy 算法时不共享。

import gym
import torch as th
import torch.nn as nn

from stable_baselines3 import PPO
from stable_baselines3.common.torch_layers import BaseFeaturesExtractor


class CustomCNN(BaseFeaturesExtractor):
    """
    :param observation_space: (gym.Space)
    :param features_dim: (int) Number of features extracted.
        This corresponds to the number of unit for the last layer.
    """

    def __init__(self, observation_space: gym.spaces.Box, features_dim: int = 256):
        super(CustomCNN, self).__init__(observation_space, features_dim)
        # We assume CxHxW images (channels first)
        # Re-ordering will be done by pre-preprocessing or wrapper
        n_input_channels = observation_space.shape[0]
        self.cnn = nn.Sequential(
            nn.Conv2d(n_input_channels, 32, kernel_size=8, stride=4, padding=0),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2, padding=0),
            nn.ReLU(),
            nn.Flatten(),
        )

        # Compute shape by doing one forward pass
        with th.no_grad():
            n_flatten = self.cnn(
                th.as_tensor(observation_space.sample()[None]).float()
            ).shape[1]

        self.linear = nn.Sequential(nn.Linear(n_flatten, features_dim), nn.ReLU())

    def forward(self, observations: th.Tensor) -> th.Tensor:
        return self.linear(self.cnn(observations))

policy_kwargs = dict(
    features_extractor_class=CustomCNN,
    features_extractor_kwargs=dict(features_dim=128),
)
model = PPO("CnnPolicy", "BreakoutNoFrameskip-v4", policy_kwargs=policy_kwargs, verbose=1)
model.learn(1000)

4.多个输入和字典类型观察

Stable Baselines3 支持处理多个输入使用DictGym 空间。这可以使用MultiInputPolicy来完成，默认情况下使用CombinedExtractor特征提取器将多个输入转换为单个向量，由net_arch网络处理。

默认情况下，CombinedExtractor按如下方式处理多个输入：

如果输入是图像（自动检测，请参阅common.preprocessing.is_image_space），则使用 Nature Atari CNN 网络处理图像并输出大小为的潜在向量256。
如果输入不是图像，则将其展平（无图层）。
将所有先前的向量连接成一个长向量并将其传递给策略。

与上面非常相似，您可以定义自定义特征提取器。以下示例假设环境在观察空间字典中有两个键：“image”是 (1,H,W) 图像（通道优先），“vector”是 (D,) 维向量。我们使用简单的下采样处理“图像”，使用单个线性层处理“矢量”。

import gym
import torch as th
from torch import nn

from stable_baselines3.common.torch_layers import BaseFeaturesExtractor

class CustomCombinedExtractor(BaseFeaturesExtractor):
    def __init__(self, observation_space: gym.spaces.Dict):
        # We do not know features-dim here before going over all the items,
        # so put something dummy for now. PyTorch requires calling
        # nn.Module.__init__ before adding modules
        super(CustomCombinedExtractor, self).__init__(observation_space, features_dim=1)

        extractors = {}

        total_concat_size = 0
        # We need to know size of the output of this extractor,
        # so go over all the spaces and compute output feature sizes
        for key, subspace in observation_space.spaces.items():
            if key == "image":
                # We will just downsample one channel of the image by 4x4 and flatten.
                # Assume the image is single-channel (subspace.shape[0] == 0)
                extractors[key] = nn.Sequential(nn.MaxPool2d(4), nn.Flatten())
                total_concat_size += subspace.shape[1] // 4 * subspace.shape[2] // 4
            elif key == "vector":
                # Run through a simple MLP
                extractors[key] = nn.Linear(subspace.shape[0], 16)
                total_concat_size += 16

        self.extractors = nn.ModuleDict(extractors)

        # Update the features dim manually
        self._features_dim = total_concat_size

    def forward(self, observations) -> th.Tensor:
        encoded_tensor_list = []

        # self.extractors contain nn.Modules that do all the processing.
        for key, extractor in self.extractors.items():
            encoded_tensor_list.append(extractor(observations[key]))
        # Return a (B, self._features_dim) PyTorch tensor, where B is batch dimension.
        return th.cat(encoded_tensor_list, dim=1)

5.On-Policy Algorithms

Shared Networks

A2C and PPO policies的 net_arch 参数允许特定数量和大小的隐藏层并且有些是共享的在policy network和value network。它假定有下面结构的列表：

任意大小（允许为零）的整数个数，每个整数指定共享层中的单元数。如果整数的数量为零，则不会有共享层。
一个可选的字典，用于为价值网络和策略网络指定以下非共享层。它的格式类似于dict(vf=[<value layer sizes>], pi=[<policy layer sizes>]). 如果它缺少任何键（pi 或 vf），则假定没有非共享层（空列表）。

简而言之格式如下： [<shared layers>, dict(vf=[<non-shared value network layers>], pi=[<non-shared policy network layers>])].

举例：

（1）两个大小为128的共享层：net_arch=[128, 128]

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-vEjROUYo-1647865227157)(C:\Users\admin\AppData\Roaming\Typora\typora-user-images\image-20220321190644992.png)]$

（2）比策略网络更深的价值网络，第一层共享：net_arch=[128, dict(vf=[256, 256])]

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UlhBiU0P-1647865227158)(C:\Users\admin\AppData\Roaming\Typora\typora-user-images\image-20220321190815775.png)]$

（3）先共享然后发散：[128, dict(vf=[256], pi=[16])]

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-dcUMBq1E-1647865227158)(C:\Users\admin\AppData\Roaming\Typora\typora-user-images\image-20220321190932553.png)]$

更高级的示例

如果您的任务需要对actor/value架构进行更精细的控制，您可以直接重新定义策略：

from typing import Callable, Dict, List, Optional, Tuple, Type, Union

import gym
import torch as th
from torch import nn

from stable_baselines3 import PPO
from stable_baselines3.common.policies import ActorCriticPolicy


class CustomNetwork(nn.Module):
    """
    Custom network for policy and value function.
    It receives as input the features extracted by the feature extractor.

    :param feature_dim: dimension of the features extracted with the features_extractor (e.g. features from a CNN)
    :param last_layer_dim_pi: (int) number of units for the last layer of the policy network
    :param last_layer_dim_vf: (int) number of units for the last layer of the value network
    """

    def __init__(
        self,
        feature_dim: int,
        last_layer_dim_pi: int = 64,
        last_layer_dim_vf: int = 64,
    ):
        super(CustomNetwork, self).__init__()

        # IMPORTANT:
        # Save output dimensions, used to create the distributions
        self.latent_dim_pi = last_layer_dim_pi
        self.latent_dim_vf = last_layer_dim_vf

        # Policy network
        self.policy_net = nn.Sequential(
            nn.Linear(feature_dim, last_layer_dim_pi), nn.ReLU()
        )
        # Value network
        self.value_net = nn.Sequential(
            nn.Linear(feature_dim, last_layer_dim_vf), nn.ReLU()
        )

    def forward(self, features: th.Tensor) -> Tuple[th.Tensor, th.Tensor]:
        """
        :return: (th.Tensor, th.Tensor) latent_policy, latent_value of the specified network.
            If all layers are shared, then ``latent_policy == latent_value``
        """
        return self.policy_net(features), self.value_net(features)

    def forward_actor(self, features: th.Tensor) -> th.Tensor:
        return self.policy_net(features)

    def forward_critic(self, features: th.Tensor) -> th.Tensor:
        return self.value_net(features)


class CustomActorCriticPolicy(ActorCriticPolicy):
    def __init__(
        self,
        observation_space: gym.spaces.Space,
        action_space: gym.spaces.Space,
        lr_schedule: Callable[[float], float],
        net_arch: Optional[List[Union[int, Dict[str, List[int]]]]] = None,
        activation_fn: Type[nn.Module] = nn.Tanh,
        *args,
        **kwargs,
    ):

        super(CustomActorCriticPolicy, self).__init__(
            observation_space,
            action_space,
            lr_schedule,
            net_arch,
            activation_fn,
            # Pass remaining arguments to base class
            *args,
            **kwargs,
        )
        # Disable orthogonal initialization
        self.ortho_init = False

    def _build_mlp_extractor(self) -> None:
        self.mlp_extractor = CustomNetwork(self.features_dim)


model = PPO(CustomActorCriticPolicy, "CartPole-v1", verbose=1)
model.learn(5000)

6.Off-Policy Algorithms

如果你需要一个网络架构他相比于SAC，DDPG或者TD3有不同actor/critic结构，可以用以下结构的字典结构dict(qf=[<critic network architecture>], pi=[<actor network architecture>])

比如你想要一个不同架构的actor（pi）和critic（qf）网络，你可以net_arch=dict(qf=[400, 300], pi=[64, 64]).

或者你的actor和critic共享相同的网络结构，你可以net_arch=[256, 256]（两个隐藏层每个有256个单元）

from stable_baselines3 import SAC

# Custom actor architecture with two layers of 64 units each
# Custom critic architecture with two layers of 400 and 300 units
policy_kwargs = dict(net_arch=dict(pi=[64, 64], qf=[400, 300]))
# Create the agent
model = SAC("MlpPolicy", "Pendulum-v1", policy_kwargs=policy_kwargs, verbose=1)
model.learn(5000)