Traceback (most recent call last):
File "src/main.py", line 442, in <module>
main(args)
File "src/main.py", line 404, in main
args.clip_max_norm, args)
File "/home/wsx/0A_DATA/HFPN/src/engine.py", line 52, in train_one_epoch
losses = sum(loss_dict[k] * weight_dict[k] for k in loss_dict.keys() if k in weight_dict)
UnboundLocalError: local variable 'loss_dict' referenced before assignment
Killing subprocess 21108
原因:分布式同时多任务训练导致显存爆了导致。
解决:改小batchsize,更换的ddp,降一下显存,处理一下数据传入。
另外可在报错语句前面加入以下进行预警。
try:
......
......
except RuntimeError as e:
if "out of memory" in str(e):
sys.exit('Out Of Memory')
else:
raise e
|