[Python知识库] gym库中core.py代码解读

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> Python知识库 -> gym库中core.py代码解读 -> 正文阅读

[Python知识库]gym库中core.py代码解读

代码地址：gym/core.py at master · openai/gym · GitHubA toolkit for developing and comparing reinforcement learning algorithms. - gym/core.py at master · openai/gymhttps://github.com/openai/gym/blob/master/gym/core.py

一导入相关模块

"""Core API for Environment, Wrapper, ActionWrapper, RewardWrapper and ObservationWrapper."""
import sys
from typing import (
    TYPE_CHECKING,
    Any,
    Dict,
    Generic,
    List,
    Optional,
    SupportsFloat,
    Tuple,
    TypeVar,
    Union,
)

import numpy as np

from gym import spaces
from gym.logger import deprecation, warn
from gym.utils import seeding

sys文件提供了一系列有关python运行环境的变量和函数的模块，如sys.argv函数实现从程序外部向程序传递参数；sys.platform函数用于获取当前系统平台。?

gym库在 Linux 和 macOS 上支持 Python 3.7、3.8、3.9 和 3.10。在Windows上并不正式支持。python版本最低为3.6。

if sys.version_info[0:2] == (3, 6):
    warn(
        "Gym minimally supports python 3.6 as the python foundation not longer supports the version, please update your version to 3.7+"
    )

“Gym 最低支持python 3.6，因为python 不再支持该版本，请将您的版本更新到3.7+”?

ObsType = TypeVar("ObsType")
ActType = TypeVar("ActType")
RenderFrame = TypeVar("RenderFrame")

python中的范类型，还没学~

二 Env类

class Env(Generic[ObsType, ActType]):
    r"""The main OpenAI Gym class.

    It encapsulates an environment with arbitrary behind-the-scenes dynamics.
    An environment can be partially or fully observed.

    The main API methods that users of this class need to know are:

    - :meth:`step` - Takes a step in the environment using an action returning the next observation, reward,
      if the environment terminated and observation information.
    - :meth:`reset` - Resets the environment to an initial state, returning the initial observation and observation information.
    - :meth:`render` - Renders the environment observation with modes depending on the output
    - :meth:`close` - Closes the environment, important for rendering where pygame is imported

    And set the following attributes:

    - :attr:`action_space` - The Space object corresponding to valid actions
    - :attr:`observation_space` - The Space object corresponding to valid observations
    - :attr:`reward_range` - A tuple corresponding to the minimum and maximum possible rewards
    - :attr:`spec` - An environment spec that contains the information used to initialise the environment from `gym.make`
    - :attr:`metadata` - The metadata of the environment, i.e. render modes
    - :attr:`np_random` - The random number generator for the environment

    Note: a default reward range set to :math:`(-\infty,+\infty)` already exists. Set it if you want a narrower range.
    """

    # Set this in SOME subclasses
    metadata: Dict[str, Any] = {"render_modes": []}
    # define render_mode if your environment supports rendering
    render_mode: Optional[str] = None
    reward_range = (-float("inf"), float("inf"))
    spec: "EnvSpec" = None

    # Set these in ALL subclasses
    action_space: spaces.Space[ActType]
    observation_space: spaces.Space[ObsType]

    # Created
    _np_random: Optional[np.random.Generator] = None

    @property
    def np_random(self) -> np.random.Generator:
        """Returns the environment's internal :attr:`_np_random` that if not set will initialise with a random seed."""
        if self._np_random is None:
            self._np_random, seed = seeding.np_random()
        return self._np_random

    @np_random.setter
    def np_random(self, value: np.random.Generator):
        self._np_random = value

    def step(
        self, action: ActType
    ) -> Union[
        Tuple[ObsType, float, bool, bool, dict], Tuple[ObsType, float, bool, dict]
    ]:
        """Run one timestep of the environment's dynamics.

        When end of episode is reached, you are responsible for calling :meth:`reset` to reset this environment's state.
        Accepts an action and returns either a tuple `(observation, reward, terminated, truncated, info)`, or a tuple
        (observation, reward, done, info). The latter is deprecated and will be removed in future versions.

        Args:
            action (ActType): an action provided by the agent

        Returns:
            observation (object): this will be an element of the environment's :attr:`observation_space`.
                This may, for instance, be a numpy array containing the positions and velocities of certain objects.
            reward (float): The amount of reward returned as a result of taking the action.
            terminated (bool): whether a `terminal state` (as defined under the MDP of the task) is reached.
                In this case further step() calls could return undefined results.
            truncated (bool): whether a truncation condition outside the scope of the MDP is satisfied.
                Typically a timelimit, but could also be used to indicate agent physically going out of bounds.
                Can be used to end the episode prematurely before a `terminal state` is reached.
            info (dictionary): `info` contains auxiliary diagnostic information (helpful for debugging, learning, and logging).
                This might, for instance, contain: metrics that describe the agent's performance state, variables that are
                hidden from observations, or individual reward terms that are combined to produce the total reward.
                It also can contain information that distinguishes truncation and termination, however this is deprecated in favour
                of returning two booleans, and will be removed in a future version.

            (deprecated)
            done (bool): A boolean value for if the episode has ended, in which case further :meth:`step` calls will return undefined results.
                A done signal may be emitted for different reasons: Maybe the task underlying the environment was solved successfully,
                a certain timelimit was exceeded, or the physics simulation has entered an invalid state.
        """
        raise NotImplementedError

    def reset(
        self,
        *,
        seed: Optional[int] = None,
        options: Optional[dict] = None,
    ) -> Tuple[ObsType, dict]:
        """Resets the environment to an initial state and returns the initial observation.

        This method can reset the environment's random number generator(s) if ``seed`` is an integer or
        if the environment has not yet initialized a random number generator.
        If the environment already has a random number generator and :meth:`reset` is called with ``seed=None``,
        the RNG should not be reset. Moreover, :meth:`reset` should (in the typical use case) be called with an
        integer seed right after initialization and then never again.

        Args:
            seed (optional int): The seed that is used to initialize the environment's PRNG.
                If the environment does not already have a PRNG and ``seed=None`` (the default option) is passed,
                a seed will be chosen from some source of entropy (e.g. timestamp or /dev/urandom).
                However, if the environment already has a PRNG and ``seed=None`` is passed, the PRNG will *not* be reset.
                If you pass an integer, the PRNG will be reset even if it already exists.
                Usually, you want to pass an integer *right after the environment has been initialized and then never again*.
                Please refer to the minimal example above to see this paradigm in action.
            options (optional dict): Additional information to specify how the environment is reset (optional,
                depending on the specific environment)


        Returns:
            observation (object): Observation of the initial state. This will be an element of :attr:`observation_space`
                (typically a numpy array) and is analogous to the observation returned by :meth:`step`.
            info (dictionary):  This dictionary contains auxiliary information complementing ``observation``. It should be analogous to
                the ``info`` returned by :meth:`step`.
        """
        # Initialize the RNG if the seed is manually passed
        if seed is not None:
            self._np_random, seed = seeding.np_random(seed)

    def render(self) -> Optional[Union[RenderFrame, List[RenderFrame]]]:
        """Compute the render frames as specified by render_mode attribute during initialization of the environment.

        The set of supported modes varies per environment. (And some
        third-party environments may not support rendering at all.)
        By convention, if render_mode is:

        - None (default): no render is computed.
        - human: render return None.
          The environment is continuously rendered in the current display or terminal. Usually for human consumption.
        - single_rgb_array: return a single frame representing the current state of the environment.
          A frame is a numpy.ndarray with shape (x, y, 3) representing RGB values for an x-by-y pixel image.
        - rgb_array: return a list of frames representing the states of the environment since the last reset.
          Each frame is a numpy.ndarray with shape (x, y, 3), as with single_rgb_array.
        - ansi: Return a list of strings (str) or StringIO.StringIO containing a
          terminal-style text representation for each time step.
          The text can include newlines and ANSI escape sequences (e.g. for colors).

        Note:
            Make sure that your class's metadata 'render_modes' key includes
            the list of supported modes. It's recommended to call super()
            in implementations to use the functionality of this method.
        """
        raise NotImplementedError

    def close(self):
        """Override close in your subclass to perform any necessary cleanup.

        Environments will automatically :meth:`close()` themselves when
        garbage collected or when the program exits.
        """
        pass

    @property
    def unwrapped(self) -> "Env":
        """Returns the base non-wrapped environment.

        Returns:
            Env: The base non-wrapped gym.Env instance
        """
        return self

    def __str__(self):
        """Returns a string of the environment with the spec id if specified."""
        if self.spec is None:
            return f"<{type(self).__name__} instance>"
        else:
            return f"<{type(self).__name__}<{self.spec.id}>>"

    def __enter__(self):
        """Support with-statement for the environment."""
        return self

    def __exit__(self, *args):
        """Support with-statement for the environment."""
        self.close()
        # propagate exception
        return False

Env类封装了一个具有任意幕后动态的环境。可以部分或完全观察环境。

主要的API有：

- ：meth：`step`-使用返回下一个观察，奖励的动作在环境中迈出一步。

如果环境终止和观察信息。
- ：meth：`reset' - 将环境重置为初始状态，返回初始观察和观察信息。
- ：meth：`render` - 根据输出而用模式呈现环境观察。
- ：meth：`close` - 关闭环境。

设置了以下属性：

?- ：attr：`action_space`-与有效动作相对应的空间对象
?- ：attr：`observation_space`-与有效观测值相对应的空间对象
?- ：attr：`ready_range`-元组对应于最低和最大可能的奖励
?- ：attr：`spec`-一个环境规范，其中包含用于初始化环境的信息``gym.make''
?- ：attr：`metadata`-环境的元数据，即渲染模式
?- ：attr：`np_random`-环境的随机数生成器

注：默认奖励范围设置为：math：`（ -? \ infty，+\ infty）`已经存在。如果您想要较窄的范围，请设置它。

需要在一些子类中设置的：

    # Set this in SOME subclasses
    metadata: Dict[str, Any] = {"render_modes": []}

如果您的环境支持渲染，请定义 render_mode ：

    # define render_mode if your environment supports rendering
    render_mode: Optional[str] = None
    reward_range = (-float("inf"), float("inf"))
    spec: "EnvSpec" = None

在所有子类中设置：

    # Set these in ALL subclasses
    action_space: spaces.Space[ActType]
    observation_space: spaces.Space[ObsType]

 @property
    def np_random(self) -> np.random.Generator:
        """Returns the environment's internal :attr:`_np_random` that if not set will initialise with a random seed."""
        if self._np_random is None:
            self._np_random, seed = seeding.np_random()
        return self._np_random

    @np_random.setter
    def np_random(self, value: np.random.Generator):
        self._np_random = value

np_random函数：返回环境的内部:attr:`_np_random`，如果未设置，将使用随机种子进行初始化。?

step

 def step(
        self, action: ActType
    ) -> Union[
        Tuple[ObsType, float, bool, bool, dict], Tuple[ObsType, float, bool, dict]
    ]:
        

        raise NotImplementedError

运行环境动态的一个时间步长。

当一个episode结束时，你负责调用:meth:`reset` 来重置这个环境的状态。接受一个动作并返回一个元组（observation, reward, terminated, truncated, info）\（观察、奖励、终止、截断、信息）或一个元组(observation, reward, done, info)\(观察、奖励、完成、信息）。后者已被弃用，将在未来的版本中删除。?

参数：

action (ActType):agent提供的动作

返回值：

observation (object):这将是环境的:attr:`observation_space` 的一个元素。例如，这可能是一个包含某些对象的位置和速度的 numpy 数组。
reward (float):采取行动后返回的奖励数值。
terminated (bool):是否达到“终端状态”（根据任务的 MDP 定义）。在这种情况下，进一步的 step() 调用可能会返回未定义的结果。
truncated (bool): 是否满足 MDP 范围之外的截断条件。通常是一个时间限制，但也可用于指示代理在物理上超出范围。可用于在达到“终端状态”之前提前结束episode。
info (dictionary):`info` 包含辅助诊断信息（有助于调试、学习和记录）。例如，这可能包含：描述agent性能状态的指标、变量隐藏在观察或组合产生总奖励的个人奖励条款中。它还可以包含区分截断和终止的信息，但是不推荐这样做返回两个布尔值，并将在未来的版本中删除。

set

def reset(
        self,
        *,
        seed: Optional[int] = None,
        options: Optional[dict] = None,
    ) -> Tuple[ObsType, dict]:
        
        # Initialize the RNG if the seed is manually passed
        if seed is not None:
            self._np_random, seed = seeding.np_random(seed)

将环境重置为初始状态并返回初始观察值。

如果`seed`是一个整数，或者如果环境还没有初始化随机数发生器，这个方法可以重置环境的随机数发生器。

如果环境已经有一个随机数生成器，并且在调用 :meth:`reset`时，`seed=None',RNG不应该被重置。此外， :meth:`reset`应该（在典型的使用情况下）在初始化后立即用一个整数种子来调用。在初始化后立即调用整数种子，然后不再调用。

参数：

seed (optional int):用于初始化环境的PRNG的种子。如果环境还没有一个PRNG，并且传递了``seed=None``（默认选项）。将从某种熵源中选择一个种子（例如，时间戳或/dev/urandom）。然而，如果环境中已经有一个PRNG，并且``seed=None''被传递，PRNG将不会被重置。如果你传递一个整数，即使PRNG已经存在，它也会被重置。通常情况下，你希望在环境被初始化后*直接传递一个整数，然后不再传递*。请参考上面的最小例子，看看这个范式的作用。
options (optional dict):用于指定环境重置方式的附加信息（可选，取决于具体的环境）。

返回值：

observation (object):对初始状态的观察。这将是 :attr:`observation_space` 的一个元素（通常是一个数组），与 :meth:`step` 返回的观察值类似的元素（通常是一个numpy数组），类似于`step`返回的观察值。
info (dictionary):该字典包含补充 "观察 "的辅助信息。它应该是类似于由:meth:`step'返回的`信息'。

render

    def render(self) -> Optional[Union[RenderFrame, List[RenderFrame]]]:
        """Compute the render frames as specified by render_mode attribute during initialization of the environment.
        The set of supported modes varies per environment. (And some
        third-party environments may not support rendering at all.)
        By convention, if render_mode is:
        - None (default): no render is computed.
        - human: render return None.
          The environment is continuously rendered in the current display or terminal. Usually for human consumption.
        - single_rgb_array: return a single frame representing the current state of the environment.
          A frame is a numpy.ndarray with shape (x, y, 3) representing RGB values for an x-by-y pixel image.
        - rgb_array: return a list of frames representing the states of the environment since the last reset.
          Each frame is a numpy.ndarray with shape (x, y, 3), as with single_rgb_array.
        - ansi: Return a list of strings (str) or StringIO.StringIO containing a
          terminal-style text representation for each time step.
          The text can include newlines and ANSI escape sequences (e.g. for colors).
        Note:
            Make sure that your class's metadata 'render_modes' key includes
            the list of supported modes. It's recommended to call super()
            in implementations to use the functionality of this method.
        """
        raise NotImplementedError

在环境的初始化过程中，按照render_mode属性的规定计算渲染帧。

支持的模式集因环境而异。(而有些第三方环境可能根本就不支持渲染）。按照惯例，如果render_mode是:?

None (default):不计算渲染。
human:渲染返回无。环境在当前的显示器或终端中被连续呈现。通常是供人查看。
single_rgb_array:返回一个代表环境当前状态的单一框架。帧是一个numpy.ndarray，其形状为(x, y, 3)，代表一个x-y像素图像的RGB值。
rgb_array:返回一个代表上次重置后环境状态的帧的列表。每个帧都是一个numpy.ndarray，形状为(x, y, 3)，与single_rgb_array一样。
ansi:返回一个字符串（str）或StringIO.StringIO的列表，其中包含每个时间步长的终端式文本表示。终端风格的文本表示的每个时间步骤。该文本可以包括换行符和ANSI转义序列（例如，颜色）。

注意：
请确保你的类的元数据 "render_modes "键包括支持的模式列表。建议在实现中调用super()来使用这个方法的功能。?

close

def close(self):
        """Override close in your subclass to perform any necessary cleanup.
        Environments will automatically :meth:`close()` themselves when
        garbage collected or when the program exits.
        """
        pass

在子类中覆盖close，以执行任何必要的清理。当垃圾回收或程序退出时，环境将自动 close()。

装饰器的部分之后再写~