问题描述
在使用VGG19做分类任务时,遇到一个问题:loss一直为nan,accuracy一直为一个固定的数,如下输出所示,即使加入了自动调整学习率 (ReduceLROnPlateau) 也没法解决问题。
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', patience=10, mode='auto')
earlystopping = EarlyStopping(monitor='val_accuracy', verbose=1, patience=30)
history = model_vgg19.fit(train_gen,
validation_data=valid_gen,
epochs=200,
steps_per_epoch=len(train_gen),
validation_steps=len(valid_gen),
callbacks=[reduce_lr, earlystopping])
输出:
Epoch 1/200
176/176 [==============================] - 31s 177ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 2/200
176/176 [==============================] - 31s 176ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 3/200
176/176 [==============================] - 31s 175ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 4/200
176/176 [==============================] - 31s 176ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 5/200
176/176 [==============================] - 31s 175ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 6/200
176/176 [==============================] - 31s 175ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 7/200
176/176 [==============================] - 31s 174ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 8/200
176/176 [==============================] - 31s 175ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 9/200
176/176 [==============================] - 31s 173ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 10/200
176/176 [==============================] - 31s 173ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 11/200
176/176 [==============================] - 30s 173ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 12/200
176/176 [==============================] - 31s 175ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 13/200
176/176 [==============================] - 31s 173ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 14/200
176/176 [==============================] - 31s 174ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 15/200
176/176 [==============================] - 31s 174ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 16/200
176/176 [==============================] - 31s 174ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 17/200
176/176 [==============================] - 30s 173ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 18/200
176/176 [==============================] - 31s 173ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 19/200
176/176 [==============================] - 31s 173ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 20/200
176/176 [==============================] - 30s 173ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 21/200
176/176 [==============================] - 31s 174ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 22/200
176/176 [==============================] - 31s 174ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 23/200
176/176 [==============================] - 31s 175ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 24/200
176/176 [==============================] - 31s 177ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 25/200
176/176 [==============================] - 31s 178ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 26/200
176/176 [==============================] - 31s 177ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 27/200
176/176 [==============================] - 31s 177ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 28/200
176/176 [==============================] - 31s 173ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 29/200
176/176 [==============================] - 31s 173ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 30/200
176/176 [==============================] - 31s 174ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 31/200
176/176 [==============================] - 31s 174ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-06
Epoch 00031: early stopping
问题的解决:调整学习率后解决问题。
最常见的原因是学习率太高。对于分类问题,学习率太高会导致模型「顽固」地认为某些数据属于错误的类,而正确的类的概率为0(实际是浮点数下溢),这样用交叉熵就会算出无穷大的损失函数。一旦出现这种情况,无穷大对参数求导就会变成 NaN,之后整个网络的参数就都变成 NaN 了。解决方法是调小学习率,甚至把学习率调成0,看看问题是否仍然存在。若问题消失,那说明确实是学习率的问题。若问题仍存在,那说明刚刚初始化的网络就已经挂掉了,很可能是实现有错误。
作者:王赟 Maigo 链接:https://www.zhihu.com/question/62441748/answer/232522878 来源:知乎 著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。
在训练的时候,整个网络随机初始化,很容易出现nan,这时候需要把学习率调小,可以尝试0.1,0.01,0.001,直到不出现nan为止,如果一直都有,那可能是网络实现问题。 作者:峻许 链接:https://www.zhihu.com/question/62441748/answer/232704244 来源:知乎 著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。
其他可能导致此问题的原因
在搜索原因的过程中看到了其他可能导致此问题的原因,在此记录一下:
1
可能的原因:
loss Nan有若干种问题:
- 学习率太高。
- 对于分类问题,用categorical cross entropy
- 对于回归问题,可能出现了除0 的计算,加一个很小的余项可能可以解决
- 数据本身是否存在Nan,可以用numpy.any(numpy.isnan(x))检查一下input和target
- target本身应该是能够被loss函数计算的,比如sigmoid激活函数的target应该大于0,同样的需要检查数据集
作者:猪了个去 链接:https://www.zhihu.com/question/62441748/answer/232520044 来源:知乎 著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。
2
可能的原因:
nan是代表无穷大或者非数值,一般在一个数除以0时或者log(0)时会遇到无穷大,所以你就要想想是否你在计算损失函数 (loss) 的时候,你的网络输出为0,因为计算了log(0)从而导致出现nan。 ———————————————— 版权声明:本文为CSDN博主「accumulate_zhang」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。 原文链接:https://blog.csdn.net/accumulate_zhang/article/details/79890624
解决方法:查看链接1、链接2、链接3
3
可能的原因:
来源:https://discuss.gluon.ai/t/topic/8925/4 另外还有一种可能就是random的augmentation在你的图像上没处理好,可能会有一些极端的augmentation,验证方法是关掉train default aug里面的一些random crop 和random expand
但是我尝试去掉在数据预处理时的图像增强 (Data augmentation, ImageDataGenerator) 后,loss还是在中途变成nan,输出如下:
Epoch 1/200
176/176 [==============================] - 27s 84ms/step - loss: 10.1720 - accuracy: 0.0182 - val_loss: 7.0242 - val_accuracy: 0.0252 - lr: 0.0010
Epoch 2/200
176/176 [==============================] - 13s 74ms/step - loss: 10.0744 - accuracy: 0.0221 - val_loss: 5.1386 - val_accuracy: 0.0168 - lr: 0.0010
Epoch 3/200
176/176 [==============================] - 13s 75ms/step - loss: nan - accuracy: 0.0149 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 4/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 5/200
176/176 [==============================] - 13s 75ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 6/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 7/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 8/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 9/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 10/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 11/200
176/176 [==============================] - 13s 75ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 12/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 0.0010
Epoch 13/200
176/176 [==============================] - 13s 75ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 14/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 15/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 16/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 17/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 18/200
176/176 [==============================] - 13s 75ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 19/200
176/176 [==============================] - 13s 75ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 20/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 21/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 22/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-04
Epoch 23/200
176/176 [==============================] - 13s 75ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 24/200
176/176 [==============================] - 13s 75ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 25/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 26/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 27/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 28/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 29/200
176/176 [==============================] - 13s 74ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 30/200
176/176 [==============================] - 13s 75ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 31/200
176/176 [==============================] - 13s 75ms/step - loss: nan - accuracy: 0.0100 - val_loss: nan - val_accuracy: 0.0100 - lr: 1.0000e-05
Epoch 00031: early stopping
问题又回归到最初的调整学习率问题。
4
可能的原因:
梯度爆炸,解决方法:降低学习率、梯度剪裁、归一化 ———————————————— 版权声明:本文为CSDN博主「鹅似一颗筱筱滴石头~」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。 原文链接:https://blog.csdn.net/Drifter_Galaxy/article/details/104004267
关于如何选择学习率:
2.学习率和网络的层数一般成反比,层数越多,学习率通常要减小 3.有时候可以先用较小的学习率训练5000或以上次迭代,得到参数输出,手动kill掉训练,用前面的参数fine tune,这时候可以加大学习率,能更快收敛哦 4.如果用caffe训练的话,你会看到没输出,但是gpu的利用率一直为0或很小,兄弟,已经nan了,只是因为你display那里设置大了,导致无法显示,设为1就可以看到了 作者:峻许 链接:https://www.zhihu.com/question/62441748/answer/232704244 来源:知乎 著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。
|