1. 作用

在训练模式下，dropout层指的是在对全连接层中的数据进行指定概率p对神经元的权重置零；从而使得在每个批次中的数据不一致，这样可以简单的看作是很多个不同的模型进行训练，从而得到更鲁棒性的权重，达到多模型融合作用，提高模型的泛化性，降低模型的过拟合率；
在这里插入图片描述

1.2 公式

$h'=\left\{ \begin{aligned} 0&&p\\ \frac{h}{1-p}& & others \\ \end{aligned} \right.$
注：由上述公式可以看出，我们需要分两步完成dropout:
（1）以概率p来对当前层权重置0
（2）将剩余的权重值乘以 $1 / (1 ? p)$

那我们为什么需要将剩余的值乘以 $1 / (1 ? p)$ ?为了保证样本的期望
$E(h')=0\times p+\frac{h}{1-p}\times (1-p)=h'$

import torch

# define dropout function
def dropout_test(x, dropout):
	"""

	:param x: input tensor
	:param dropout: probability for dropout
	:return: a tensor with masked by dropout
	"""
	# dropout must be between 0 and 1
	assert 0 <= dropout <= 1
	# if dropout is equal to 0;just return self_x
	if dropout == 0:
		return x
	# if dropout is equal to 1: put all values to zeros in tensor x
	if dropout == 1:
		return torch.zeros_like(x)
	# torch.rand is for return a tensor filled with values from uniform distribution [0,1)
	# we compare the values with dropout,if values is greater than dropout ,return 1,else 0
	mask = (torch.rand(x.shape) > dropout).float()
	# mask times x and give the scale(1-dropout) for the same expectation with before
	return mask * x / (1.0 - dropout)


input = torch.rand(3, 4)
dropout = 0.8
output = dropout_test(input, dropout)

print(f"input={input}")
print(f"output={output}")

2. nn.Dropout

pytorch中有两个dropout，一个是函数形式的torch.nn.functional.dropout;一个是封装好的类torch.nn.Dropout
在训练过程中，用伯努利分布的样本随机地将输入张量的一些元素以概率p归零。每个通道将在每个前转呼叫中独立地归零。

import torch
from torch import nn
from torch.nn import functional as F


my_dropout = nn.Dropout(p=0.6)
x_input = torch.randn(2,3,4)
x_output = my_dropout(x_input)
print(f"x_input={x_input}")
print(f"my_dropout={my_dropout}")
print(f"x_output={x_output}")

x_input=tensor([[[-1.1249,  0.4000,  1.3708,  0.7556],
         [-0.3823,  0.4001,  0.0950,  0.8916],
         [-1.2449, -0.8080,  0.2976, -2.3220]],

        [[ 0.6720,  1.9750, -0.5260, -1.6763],
         [ 1.2277, -0.0918, -1.4739,  0.3409],
         [ 0.2559, -0.8436, -0.5755, -0.4961]]])
my_dropout=Dropout(p=0.6, inplace=False)
x_output=tensor([[[-0.0000,  0.0000,  0.0000,  1.8891],
         [-0.0000,  0.0000,  0.2375,  2.2291],
         [-0.0000, -0.0000,  0.0000, -0.0000]],

        [[ 0.0000,  4.9375, -1.3149, -0.0000],
         [ 0.0000, -0.0000, -0.0000,  0.8523],
         [ 0.0000, -2.1089, -0.0000, -0.0000]]])

3. dropout 的numpy实现

3.1 第一种实现:

训练模式下只丢弃部分权重，不乘以系数 1/(1-rate);那么测试模式值需要乘以(1-rate)；这样就可以保持不变

train_mode:
$h'=\left\{ \begin{aligned} 0&&&p\\ {h}& && 1-p \\ \end{aligned} \right.$
$E(h')=0\times p + h*(1-p)=h(1-p)$
test_mode:
因为测试阶段不使用drop_out，所以为了使得训练模式train_mode和测试模式test_mode保持期望一致，那么我们需要将所有的值乘以(1-p);
$E (h^{'}) = h^{'} (1 ? p)$

import numpy as np

def train(rate, x, w1, b1, w2, b2):
	"""
	description:
	if the train cannot use scale(1/(1-rate)) for output;
	then we need mutiply by (1.0-rate) for keeping the same expectation
	:param rate: probability of dropout
	:param x: input tensor
	:param w1: weight_1 of layer1
	:param b1: bias_1 of layer1
	:param w2: weight_2 of layer2
	:param b2: bias_2 of layer2
	:return: layer2
	"""
	layer1 = np.maximum(0, (np.dot(x, w1) + b1))
	mask1 = np.random.binomial(1, 1.0 - rate, layer1.shape)
	layer1 = layer1 * mask1

	layer2 = np.maximum(0, (np.dot(layer1, w2) + b2))
	mask2 = np.random.binomial(1, 1.0 - rate, layer2.shape)
	layer2 = layer2 * mask2
	return layer2


def test(rate, x, w1, b1, w2, b2):
	layer1 = np.maximum(0, np.dot(x, w1) + b1)
	layer1 = layer1 * (1.0 - rate)

	layer2 = np.maximum(0, np.dot(layer1, w2) + b2)
	layer2 = layer2 * (1.0 - rate)

	return layer2

3.2 第二种实现：

训练模式下丢弃部分权重，乘以系数 1/(1-rate);这样就保证期望不变，那么测试模式值就可以不做处理；保持训练和测试期望一致

train_mode:
$h'=\left\{ \begin{aligned} 0&&&p\\ \frac{h}{1-p}& && 1-p \\ \end{aligned} \right.$
$E(h')=0\times p + h/(1-p)*(1-p)=h$
test_mode:

因为测试阶段不使用drop_out，所以为了使得训练模式train_mode和测试模式test_mode保持期望一致，我们在测试模式下值可以不变
$E (h^{'}) = h^{'}$

import numpy as np

def another_train(rate, x, w1, b1, w2, b2):
	layer1 = np.maximum(0, (np.dot(x, w1) + b1))
	mask1 = np.random.binomial(1, 1.0 - rate, layer1.shape)
	layer1 = layer1 * mask1
	layer1 = layer1 / (1.0 - rate)
	
	layer2 = np.maximum(0, (np.dot(layer1, w2) + b2))
	mask2 = np.random.binomial(1, 1.0 - rate, layer2.shape)
	layer2 = layer2 * mask2
	layer2 = layer2 / (1.0 - rate)

	return layer2


def another_test(x, w1, b1, w2, b2):
	layer1 = np.maximum(0, np.dot(x, w1) + b1)
	layer2 = np.maximum(0, np.dot(layer1, w2) + b2)
	return layer2