开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 人工智能 -> Traffic light control using deep policy-gradient and value-function-basedreinforcement learning -> 正文阅读

[人工智能]Traffic light control using deep policy-gradient and value-function-basedreinforcement learning

?ISSN 1751-956X

作者：Seyed Sajad Mousavi1 , Michael Schukat1, Enda Howley

Abstract:

Recent advances in combining deep neural network architectures with reinforcement learning (RL) techniques have shown promising potential results in solving complex control problems with high-dimensional state and action spaces.

最近进展在结合深度神经网络结构和强化学习技术已经展示了一个有潜在希望的结果在解决复杂控制问题这个问题伴随这高维状态和动作空间。

Inspired by these successes, in this study, the authors built two kinds of RL algorithms: deep policy-gradient (PG) and value-function-based agents which can predict the best possible traffic signal for a traffic intersection.

通过这些成功的激发，在这些研究中，作者建立了两种强化学习算法：深度策略梯度（DPG）和基于值函数的智能体，这个智能体可以预测最好的十字路口交通信号的可能。

At each time step, these adaptive trafficlight control agents receive a snapshot of the current state of a graphical traffic simulator and produce control signals.?

在每步中，这些自适应交通灯控制智能体收到一个快照一个现在交通模拟器状态的快照和产生一个控制信号。

The PG-based agent maps its observation directly to the control signal;?

PG-based智能体管理他的直接观察去控制信号；

however, the value-function-based agent first estimates values for all legal control signals.?

然而，基于值函数的智能体首先评价价值对于全部合法的控制信号。

The agent then selects the optimal control action with the highest value.

智能体接着选择最佳的控制动作，这个动作有着最高的价值。

Their methods show promising results in a traffic network simulated in the simulation of urban mobility traffic simulator, without suffering from instability issues during the training process.

他们的方法展示了有希望的结果在交通网络仿真中，在城市移动交通仿真中，没有面临不稳定的问题在训练过程中。

1 Introduction

With regard to fast growing population around the world, the urban population in the 21st century is expected to increase dramatically.

世界上的环境在快速的增长，21世界的城市污染预计将会快速增长。

Hence, it is imperative that urban infrastructure is managed effectively to contend with this growth.

因此，它是有必要的。什么是有必要的城市的基础设施管理效率主张伴随着增长。

One of the most critical consideration when designing modern cities is developing smart traffic management systems.

大多数评论考虑之一是当设计现在城市是发展智能交通管理系统。

The main goal of a traffic management system is reducing traffic congestion which nowadays is one of the major issues of megacities.

主要的目标一个交通管理系统是减少交通拥堵（当今大城市的主要问题之一）。

Efficient urban traffic management results in time and financial savings as well as reducing carbon dioxide emission into atmosphere.

高效的城市管理结果可以节约时间和金钱除了减少二氧化碳排放进入大气层。

To address this issue, a lot of solutions have been proposed [1–4].

为了解决此问题，许多解决方案被提出。（直接引用1-4）

They can be roughly classified into three groups.

他们粗略的分为三组。

The first is pre-timed signal control, where a fixed time is determined for all green phases according to historical traffic demand, without considering possible fluctuations in traffic demand.?

第一种是预定时的信号控制，一个固定的时间是坚定的为了全部绿等相位通过它的历史的交通需要，没有考虑可能波动在交通需要。

The second is vehicle-actuated signal control, where traffic demand information is used, provided by inductive loop detectors on an equipped intersection to decide to control the signals, e.g. extending or terminating a green phase.

第二种是交通驱动的信号灯控制，交通需要控制信息是有用的，被提供通过感应线圈检测器在一个装备十字路口去决定控制信号，例如。延伸或者结束绿灯相位。

The third is adaptive signal control, where the signal timing control is managed and updated automatically according to the current state of the intersection (i.e. traffic demand, queue length of vehicles in each lane of the intersection and traffic flow fluctuation) [5].?

第三种是自适应交通灯控制，信号定时控制是管理和更新自动化的通过目前的十字路口的状态（换言之，交通需求，车辆的排队长度在每一条十字路口的车道和交通流波动）。

In this paper, we are interested in the third approach and aim to propose two novel methods for traffic signal control by leveraging recent advances in machine learning and artificial intelligence fields [6, 7].

在这篇论文中，我们对第三种方法感兴趣，并且致力于去提出两种新颖的方法为了交通信号控制通过杠杆扩充最近接近机器学习和人工智能领域。

Reinforcement learning (RL) [8] as a machine learning technique for traffic signal control problem has led to impressive results [2, 9] and has shown a promising potential solver.

强化学习作为机器学习技术为了交通灯信号控制问题已经导致令人印象深刻的结果[2,9]并且已经展示一个有期望的潜在解决方案。

It does not need to have a perfect knowledge of the environment in advance, for example, traffic flow.

它不提前需要环境的先验知识，例如，交通流。

Instead they are able to gain knowledge and model the dynamics of the environment just by interacting with it.?

代替他们能够收获知识和模型（环境的动态）仅仅相互作用和环境。

An RL agent learns based on trial and error. It receives a scalar reward after taking each action in the environment.?

一个强化学习之恩那个提学习基于一个实验和错误。它收到一个数量的奖赏之后采取每一个动作在环境中。

The obtained reward is based on how well the taken action is and the agent's goal is to learn an optimal control policy, so the discounted cumulative reward is maximised via repeated interaction with its environment.?

获得的奖赏是基于有多好采取动作是和智能体的目标是去学习一个最佳的控制策略，因此不重视累积的奖赏是最大限度通过反复和环境相互作用。

Aside from traffic control, RL has been applied to a number of real-world problems such as cloud computing [10, 11].

除了交通控制，强化学习实用的许多真实世界问题像云计算[10-11]。

Typically, the complexity of using RL in real-world applications such as traffic signal management, grows exponentially as state and action spaces increase.

典型的，强化学习算法在真实世界的应用是复杂的，比如交通信号管理，指数增长随着状态和动作空间增加。

To deal with this problem, function approximation techniques and hierarchical RL (HRL) approaches can be used.?

为了解决这个问题，函数近似技术和分等级的强化学习算法可以被应用。

Recently, deep learning has gained huge attraction and has been successfully combined with RL techniques to deal with complex optimisation problems such as playing Atari 2600 games [7], Computer Go program [12] etc., where the classical RL methods could not provide optimal solutions.

最近，深度学习已经取得巨大的吸引力并且已经成功的结合强化学习技术去解决了复杂的最优化问题比如，玩Atari 2600游戏，电脑GO 程序，等等，这种典型的强化学习算法不可以提供最佳的解决方案。

In this way, the current state of the environment is fed into a deep neural net [e.g. a convolutional neural network (CNN) [13]] trained by RL techniques to predict the next possible optimal action(s).

用这种方法，现在的环境状态是 馈入（反馈进入的意思吧）一个深度神经网络[例如一个卷积神经网络CNN]训练通过一个强化学习技术去预测下一个可能的最佳动作。

Inspired by the successes of combining RL with deep learning paradigm and with regard to the complex nature of environment of traffic signal control problem, in this paper we aim to use the effectiveness and power of deep RL to build adaptive signal control methods in order to optimise the traffic flow.

有灵感的通过组合强化学习和深度学习范例的成功并且关于这个复杂的性质交通信号控制的环境，在这篇论文中我们止于礼用高效的有力的深度强化学习方法去建立自适应信号灯控制目的是去自适应交通流。

Although a few previous studies have tried to apply deep RL in the traffic signal control problem [14, 15], in this research the state representation is different.

尽管一些（很少的意思屈指可数）以前的研究已经试着去应用深度强化学习在交通信号控制问题，在这个调查的表现是不同的。（意思是我很好我很棒我优秀）

Also, one of our methods uses policy-gradient (PG) method which does not suffer from oscillations and instabilities during the training process and can take full advantage of the available data of the environment to develop the optimal control policy.

尽管，我们方法的一种用策略梯度的方法，不必经历振幅和不稳定在训练过程中，并且可以用到所有可以用到的数据环境中去进化自适用控制策略。

We propose adaptive signal controllers by combination of two RL approaches (i.e. PG and action-value function) and a deep convolution neural network, which perceive embedded camera observations in order to produce control signals in an isolated intersection.?

我们提议自适应信号控制通过两个强化学习方法组合（换言之，策略梯度和动作-价值函数）和深度卷积神经网络，深度感知相机观测为了去生产控制信号灯在一个独立的十字路口。（实验配置吧）

We conduct simulated experiments with our proposed methods in simulation of urban mobility (SUMO) traffic simulator.

我们实施仿真实验伴随着我们的提议方法在sumo交通仿真软件中。

The rest of this paper is organised as follows.

论文其余的被组织如下所示。

Section 2 provides related work in the area of traffic light control (TLC).

第二部分提供相关的工作在交通灯控制。

Section 3 gives a brief review of RL techniques which we have used in this research.?

第三部分给出一个简短的评论关于强化学习，我们已经使用在这篇论文中。

Section 4 presents how to formulate the TLC problem as an RL task and the proposed methods to solve the task.

第四部分提交我们怎样去构想这个交通灯控制问题作为一个强化学习人物和被提议的方法去解决这个任务。

Then, Section 5 provides simulation results and the performance of the proposed approaches.

接着，第五部分提供仿真结果和表现通过被提议的方法。

Finally, Section 6 concludes this paper and gives some directions for future research.

最后，第六部分结论这篇文章并且给出一些方向为了将来的研究调查。

2 Related work

A lot of research has been done in academic and industry communities to build adaptive traffic signal control systems.

许多调查已经被做通过在学术和工业团体中去建立一个自适应交通控制系统。

In particular, significant research has been conducted employing RL methods in the area of traffic light signal control [16–20].

尤其是，重要的调查已经管理雇佣强化学习方法在一些区域交通信号灯控制[16-20].

These works have achieved promising results.?

这些工作已经达到了有期望的结果。

However, their simulation testbeds have not been mature enough to be comparable with more realistic situations.?

然而，他们的仿真实验没有充分考虑足够的充分的相似更加真实的情况。?

Developing advanced traffic simulation tools have made researchers develop novel state representation and reward functions for RL algorithms, which could consider more aspects of complexity and reality of real-world traffic problems [3,5, 21–24].

发展先进的交通信号仿真工具已经使得研究者开发新颖的状态表现和奖赏函数为了强化学习算法，强化学习算法可以考虑更多复杂的相位和真实世界的交通问题。

?All these attempts viewed the TLC problem as a fully observable Markov decision process (MDP) and investigated whether Q-learning algorithm can be applied to it.

全部这些尝试的观点交通灯自适应控制可以作为一个完全观察马尔可夫观测过程并且研究是否Q-learning算法可以应用它。

However,Richter's study formulated the traffic problem as a partially observable MDP (POMDP) and applied PG methods to guarantee local convergence under a partial observable environment [25].

然而，richter的研究规划这个交通问题作为一部分观测到的马尔可夫观测过程和使用的策略梯度方法去保证局部几种作为一个不公平的观测环境。

By utilising advances in deep learning and its application to different domains [11, 26, 27], deep learning has gained attention in the area of traffic management systems.

通过使用接近在深度学习中并且它的申请去不同的领域，深度学习已经得到了注意力在交通管理系统的

The previous research has used deep stacked auto-encoders (SAEs) neural networks to estimate Q values, where each Q-value is corresponding to each available signal phase [28].?

之前的研究已经使用深度堆叠自动编码器（SAE）神经网络来估计Q值，其中每个Q值对应于每个可用的信号相位[28]。

It considered measures of speed and queueing length as its state in each time step of learning process of its proposed method.?

它考虑加速的测量方法和排队长度随着它的情形在每次学习过程中他提议的方法。

Two recent studies by van der Pol and Oliehoek [14] and Genders and Razavi [15] provided deep RL agents that used deep Q-network [7] to map from given states to Q values.?

两个学习智能最近的研究通过两个人提供的深度强化体，智能体用了深度q-network去绘制地图通过被给出状态的q 值。

Their state representations were a binary matrix of the positions of vehicles on the lanes of an intersection, and a combination of the presence matrix of vehicles, speed and the current traffic signal phase, respectively.?

他们的状态陈述是一个二进制矩阵这个矩阵是形容在十字路口车道线上车辆位置的矩阵，并且一个车辆存在矩阵组合，速度和现在交通信号相位，各自的。

However, we use raw visual input data of the traffic simulator snapshots as system states.

然而，我们用生的、无经验的可看见的输入数据交通信号仿真快照作为一个系统的数据。
Moreover, in addition to estimating Q-function, one of the proposed methods directly maps from the input state to a probability distribution over actions (i.e. signal phases) via deep PG method.

此外，除此之外估计q函数，被提议方法的一个直接管理地图来自输入清醒去一个可能的分布遍于各种动作之间（换言之，信号灯相位）通过深度策略梯度方法。

?3 Background

In this section, we will review RL approaches and briefly describe how RL is applied to real-world problems where the number of states and actions are extremely high, so that the regular RL techniques cannot deal with them.

在这个章节，我们将要回顾强化学习算法方式和短暂的表述怎样强化学习被应用到真实世界的问题中，状态和动作的数量是极端的高，因此经常强化学习技术不可以解决他们。（强化学习不行了）

3.1 Reinforcement learning

A common RL [8] setting is shown in Fig. 1, where an RL agent interacts with an environment.

一个普通的共同的强化学习环境配置如图所示，哪里是强化学习智能体互动与周围环境。

The interaction is continued until reaching a terminal state or the agent meets a termination condition.

交互将持续，直到达到终端状态或代理满足终止条件。

Usually, the problems that RL techniques are applied to are treated as MDPs.

通常，强化学习技术是被应用到马尔可夫问题中的。

An MDP is defined as a five tuple S, A, T, R,γ , where S is the set of states in the state space of the environment, A is the set of actions in the action space that the agent can use in order to interact with the environment, T is the transition function, which is the probability of moving between the environment states, R is the reward function andγ∈ [0, 1] is known as the discount factor, which models the importance of the future and immediate rewards.

一个马尔可夫问题被定义通过五个元组S\A\T\R\γ，其中S是状态设置空间集合一个环境的，A是动作空间集合那个智能体可以用为了和环境互动,T是转移函数，它是一个概率，一个在环境状态移动的概率，R是奖赏函数，γ是属于0-1之间的一个数，被称为折扣因子，它模拟了重要性，将来和现在重要性的重要性。

?At each time step t, the agent??perceives the state st∈ S and, based on its observation, selects an action at.?

在每一个时间步t中，代理意识到状态在st状态属于S状态动作空间，基于他们的注意力，选择和动作。

Taking the action, leads to the state of the environment transitions to the next states st + 1∈ S regarding the transition function T.?

采取该操作，导致环境状态转换到下一个状态 st + 1∈ S 就转移函数 T来说。

Then, the agent receives reward rt which is determined by the reward function R.

接着，代理收到奖赏rt，奖赏通过奖赏函数R来确定。

The goal of the learning agent in RL framework is to learn an optimal policyπ: S × A→ [0, 1] which defines the probability of selecting action at in state st, so that with following the underlying?policy the expected cumulative discounted reward over time is maximised.?

强化学习架构中学习智能体的目标是去学习一个最佳的策略Π：状态×动作叉乘的结果在[0，1]之间，这些定义一些潜在的选择动作在状态st，中，所以下面的基础策略是预期的累积折扣奖励，随着时间推移最大化的。

The discounted future reward, Rt, at time t is defined as follows:

折扣函数奖赏Rt，在时间t定义如下：

where the role of the discount factor γ is to trade-off the worth of immediate and future rewards.

折扣函数γ的作用是去平衡交换现在和将来的奖励价值。

In most real-world problems, there are many states and actions which make it impossible to apply
classic RL techniques, which consider tabular representations for their state and action spaces.

在真实的世界问题中，有许多状态和动作这些使得它不可能去应用分类强化学习技巧，这些被考虑列表陈述为了全部的状态动作空间。

Hence, it is common to use function approximators [29] or decomposition and aggregation techniques such as HRL approaches [30–32] and advance HRL [33].

因此，它是正常的去用函数策略或者分布式聚集技术，例如HRL方式和提升HRL。

Different forms of function approximators can be used with RL techniques.

不同形式的函数近似器可以用强化学习技术。

For example, linear function approximation, a linear combination of feature of state and action spaces f and learned weights w (e.g.∑i f iw) or a non-linear function approximation (e.g. a neural network).

比如，线性函数近似，状态和动作空间的特征会排列组合形成一个线，学习权重，或者非线性函数近似（比如：一个神经网络）。

Until recently, the majority of work in RL has been applying linear function approximators.

直到最近，强化学习的主要工作已经被应用在线性函数近似器。

More recently, deep neural networks (DNNs) such as CNNs, recurrent neural networks, SAE etc. have also been commonly used as function approximators for large RL tasks [6, 34].

最近，深度神经网络DNNs,,比如卷积神经网络，周期性神经网络，SAE等等。通常已经使用函数近似为了巨大的强化学习任务。

The interested readers are referred to [35] for a review of using DNNs with RL framework.

有兴趣的读者可以引用去回顾使用DNNs网络伴随着强化学习框架。

3.2 Deep learning and deep Q-learning

3.3 PG methods

4 System description

In this section, we will formulate TLC problem as an RL task by describing the states, actions and reward function.?

在这部分，我们将要构建TLC问题作为一个强化学习任务通过描述状态、动作和奖赏函数。

We then present the policy as a DNN and how to train the network.

我们接着提出策略作为深度神经网络和怎样去训练网络。

4.1 State representation （状态表示）

We represent the state of the system as an image st∈ Rd or a snapshot of the current state of a graphical simulator {e.g. SUMO-graphical user interface (GUI) [42]} which is a vector of raw pixel
values of current view of the intersection at each step of simulation(as shown in Fig. 1).?

我们将系统的状态表示为图像 St属于Rd或者一个快照最近状态的快照现在状态一个图片仿真的（比如：SUMO-绘画使用者界面）是交叉点当前视图的原始像素值的矢量，在十字路口在每一步仿真中例如1系统所示。

This kind of representation is such as putting a camera on an intersection which enables it to view the whole intersection.

这种表现是像设置一个相机在十字路口使它能够去观察整个十字路口的情况。

The state representation in the TLC literature usually uses a vector representing the presence of a vehicle at the intersection, a Boolean-valued vector where a value 1 indicates the presence of a vehicle and a value 0 indicates the absence of a vehicle [14, 43] or a combination of the presence vector with another vector indicating the vehicle's speed at the given intersection [15].?

状态表现在十字路口交通灯文学通常用一个矢量表示在十字路口车辆的一个车辆的表现，一个布尔向量价值1表明车辆的存在和一个值0表示一个车辆的缺失，或者一个存在向量的组合和另一个向量的指示车辆的速度在被给出的交通路口。

Regardless of these states representations that are using a prior knowledge provided, they make assumptions which are not generalisable for the real world.?

无论这些状态怎样陈述，提供怎样的先验知识，他们使用假定都不可可推广于真实世界。

However, by?feeding the state as an image to a CNN, the system can detect the?location and the presence of all vehicles with different lengths and?as a result the vehicles’ queue on each lane.

但是，通过将状态作为图像提供给CNN，系统可以检测位置和全部车辆的存在伴随着不同的长度和每个车辆长度的结果。

Furthermore, by stacking a history of consecutive observations as input, the convolutional layers of a deep network are able to estimate velocity and travel direction of vehicles.?

此外，通过堆叠一个历史的连贯的观察作为输入，深度网络的卷积层是估计速率和车辆的旅行方向

Hence, the system can implicitly benefit from this information as well.

因此，系统也可以隐含地从这些信息中受益。

4.2 Action set（动作集）

To control traffic signal phases, we define a set of possible actions
A = {North/South green (NSG), East/West green (EWG)}.

交通灯相位控制，我们定义一个可能的动作A={北南绿等，东西绿等}

NSG
allows vehicles to pass from North-to-South and vice versa and
also indicates the vehicles on East/West route should stop and not
proceed through the intersection.

南北绿等孕育车辆通过北到南并且反之亦然并且也表明车辆在西或东应该停止和不应钱锦通过红绿灯。

EWG allows vehicles to pass
from East to West and vice versa and implies the vehicles on
North/South route should stop and not proceed through the
intersection.?

东或西红绿灯允许车辆通行来自东去西反之亦然是按时车辆在南北路线应该停止河不应前进在十字路口。

At each time step t, an agent regarding its strategy chooses an action at∈ A.?

在每一步时间t内，一个智能体关于它的策略选择一个动作在几个A中。

Depending on the selected action, the vehicles on each lane are allowed to cross the intersection.

依赖在选择动作，车辆在每一个车道线是允许去通过十字路口。

4.3 Reward function

Typically, an immediate reward rt∈? is a scalar value which the agent receives after taking the chosen action in the environment at each time step.

具有代表性的，一个立即奖赏，奖赏在集合李是一个标量函数智能体收到之后采取动作之后在环境中每一个时间步。

We set the reward as the difference between the total cumulative delays of two consecutive actions, i.e.

我们设置奖赏作为一个不同在总共的累积演示两个炼虚动作，除此之外。

where Dt and Dt? 1 are the total cumulative delays in the current and previous time steps.?

哪里t时刻的D 和t-1上一个时刻的D一共累积延时在现在和过去的时间步长中。

The total cumulative delay at time t is the summation of the cumulative delay of all the vehicles appeared from t = 0 to current time step t in the system.

总共积累的延时在时间t时刻是总和累积在全部的车辆出现自0时刻开始去现在的时间不在系统中。

The positive reward values imply the taken actions led to decrease the total cumulative delay and the negative rewards imply an increase in the delay.

这个正面的奖赏价值暗示采取动作导致下降总共的累积的延时和消极的奖赏暗示增大在延时中。

With regard to the reward values, the agent may decide to change its policy in certain states of the system in the future.

关于奖赏函数，智能体可能决定去改变它的策略在确定的状态在系统将来。

4.4 Agent's policy

The agent chooses the actions based on a policy π.

代理根据策略选择动作。

In the policy-based algorithm, the policy is defined as a mapping from the input state to a probability distribution over actions A.?

在基于策略的算法中，策略是被定义作为一个绘图输入状态去一个可能潜在的分布关于动作的分布。

We use the DNN as the function approximator and refer its parameters θ as policy parameters.

我们用DNN网络作为一个函数策略和涉及它的参数θ作为一个策略函数。

The policy distribution π(at | st;θ) is learned by performing gradient descent on the policy parameters.

分布策略Π（）是学习通过表演梯度下降在策略参数中。

The action-value-function maps the input state to action values, which each represents the future reward that can be achieved for the given state and action.

动作值函数地图输入状态去动作价值，每一个表现将来奖赏可以是一个达到为了被给出的状态和动作。

The optimal policy can then be extracted by performing a greedy approach to select the
best possible action.

最佳的策略可以被萃取通过一个贪婪的接近方法去选择最佳的动作。

4.5 Objective function and system training（目标函数和系统训练）

There are many measures such as maximising throughput, minimising and balancing queue length, minimising the delay etc.

有许多方法比如最大限度的输入，最小配平排队长度，最小的延时等等。

in the traffic signal management literature to consider as the learning agent's objective function.

在交通信号管理文献中去考虑学习智能体的目标函数。

In this research, the agent aims to maximise the reduction in the total cumulative delay, which
empirically has been shown to maximise throughput and to reduce queue length (more details discussed in Section 5.3).

在这项研究中，代理旨在最大限度地减少总累积延迟，经验表明这可以最大限度地提高吞吐量并减少队列长度（更多细节在第5.3节中讨论）。

The objective of the agent is to maximise the expected cumulative discounted reward.?

智能体的目标是去最大限度的去提高预期的折扣和奖励。

We aim to maximise the reward under the probability distributionπ(at | st;θ).

我们的目的是最大限度的奖赏在Π可能分布之下。

We divide the system training based on two RL approaches: value-function-based and policy-based.?

我们分开系统训练基于两个强化学习方法：基于值函数的和基于策略的。

In value-function-based approach, the value function Qπ(s, a) is defined as follows:

基于值函数算法和基于策略函数的被定义如下：

where it is implicit that s, s′∈ S and a∈ A.?

The value function can be parameterised Q(s, s;θ) with parameter vector θ.?

值函数可以参数化Q（s,s:θ）伴随着参数向量。

Typically, the gradient-descent methods are used to learn parameters,θ by trying?to minimise the following loss function of mean-squared error in Q values,where r +γ maxa′Q(s′, a′;θ) is the target value. In the DQN algorithm, a target Q-network is used to address the instability problem of the policy.?

通常，梯度下降方法用于学习参数θ，方法是尝试最小化Q值中均方误差的以下损失函数，其中 +γ maxa′Q（s′， a′;θ）是目标值。在 DQN 算法中，目标 Q 网络用于解决策略的不稳定问题。

The network is trained with the target Q-network to obtain consistent Q-learning targets by keeping the weight parameters (θ?) used in the Q-learning target fixed and updating them periodically every N steps through the parameters of the main network θ.?

使用目标 Q 网络对网络进行训练，通过保持 Q 学习目标中使用的权重参数（θ?）固定，并通过主网络 θ 的参数每 N 步定期更新一次，从而获得一致的 Q 学习目标。

The target value of the DQN is represented as follows:

DNQ的目标表表示如下：

whereθ? are parameters of the target network. The stochastic gradient-descent method is used in order to optimise (5).?

其中θ? 是目标网络的参数。随机梯度下降法用于优化（5）。

The parameters of the deep Q-learning algorithm are updated as follows:

深度 Q 学习算法的参数更新如下：

where yi is the target value for iteration i andα is a scalar learning rate.?

其中 yi 是迭代 i 的目标值，而 α 是标量学习速率。?

Algorithm 1 (see Fig. 2) presents the pseudo-code for the training algorithm.

算法 1（参见图 2）提供了训练算法的伪代码。

In policy-based approach, the gradient of the objective function represented in (3) is given by:?

?在基于策略的方法中，（3）中表示的目标函数的梯度由下式给出：

This (8) is standard learning rule of the REINFORCE algorithm[44].?

这个（8）是强化算法的标准学习规则[44]。

It updates the policy parameters θ in the direction ?θ log(at | st;θ), so that the probability of action at at state st is increased if it has led to high cumulative reward; however, it is decreased if the action has result in a low reward.

人工智能最新文章

2022吴恩达机器学习课程——第二课（神经网

第十五章规则学习

FixMatch: Simplifying Semi-Supervised Le

数据挖掘Java——Kmeans算法的实现

大脑皮层的分割方法

【翻译】GPT-3是如何工作的

论文笔记:TEACHTEXT: CrossModal Generaliz

python从零学（六）

详解Python 3.x 导入(import)

【答读者问27】backtrader不支持最新版本的

加:2022-03-10 22:31:02 更:2022-03-10 22:34:12

360图书馆购物三丰科技阅读网日历万年历 2025年7日历

-2025/7/16 15:22:32-

图片自动播放器
↓图片自动播放器↓

TxT小说阅读器
↓语音阅读,小说下载,古典文学↓

一键清除垃圾
↓轻轻一点,清除系统垃圾↓

图片批量下载器
↓批量下载图片,美女图库↓

网站联系: qq:121756557 email:121756557@qq.com IT数码