当我设置"find_unused_parameters=True", 是模型在训练不收敛,感觉好像啥都没学到似的,感觉应该就是 gradients 出现了问题。
当我设置"find_unused_parameters=False" 会报一下的错误,这个错误是因为 decoder 没有返回梯度 gradient ,这是什么原因造成的 他没有返回梯度呢?有什么建议吗?
In my model training, if "find_unused_parameters=False", it will raise an error as follows:
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True
to torch.nn.parallel.DistributedDataParallel
, and by
making sure all forward
function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward
function. Please include the loss function and the structure of the return value of forward
of your module when reporting this issue (e.g. list, dict, iterable).
Parameters which did not receive grad for rank 1: decoder.0.norm1.weight, decoder.0.norm1.bias, decoder.0.self_attn.qkv.weight, decoder.0.self_attn.proj.weight, decoder.0.self_attn.proj.bias, decoder.0.norm_q.weight, decoder.0.norm_q.bias, decoder.0.norm_v.weight, decoder.0.norm_v.bias, decoder.0.cross_attn.q_map.weight, decoder.0.cross_attn.k_map.weight, decoder.0.cross_attn.v_map.weight, decoder.0.cross_attn.proj.weight, decoder.0.cross_attn.proj.bias, decoder.0.norm2.weight, decoder.0.norm2.bias, decoder.0.mlp.fc1.weight, decoder.0.mlp.fc1.bias, decoder.0.mlp.fc2.weight, decoder.0.mlp.fc2.bias, decoder.1.norm1.weight, decoder.1.norm1.bias, decoder.1.self_attn.qkv.weight, decoder.1.self_attn.proj.weight, decoder.1.self_attn.proj.bias, decoder.1.norm_q.weight, decoder.1.norm_q.bias, decoder.1.norm_v.weight, decoder.1.norm_v.bias, decoder.1.cross_attn.q_map.weight, decoder.1.cross_attn.k_map.weight, decoder.1.cross_attn.v_map.weight, decoder.1.cross_attn.proj.weight, decoder.1.cross_attn.proj.bias, decoder.1.norm2.weight, decoder.1.norm2.bias, decoder.1.mlp.fc1.weight, decoder.1.mlp.fc1.bias, decoder.1.mlp.fc2.weight, decoder.1.mlp.fc2.bias
Parameter indices which did not receive grad for rank 1: 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 463885) of binary: /home/mnt/xyqian/miniconda3/envs/detector_21806_2/bin/python
/home/mnt/xyqian/miniconda3/envs/detector_21806_2/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py:367: UserWarning: