"find_unused_parameters=True" 这个参数的 true 设置会影响多卡模型训练的收

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

这是一个创建于 628 天前的主题，其中的信息可能已经有所发展或是发生改变。

当我设置"find_unused_parameters=True", 是模型在训练不收敛，感觉好像啥都没学到似的，感觉应该就是 gradients 出现了问题。

当我设置"find_unused_parameters=False" 会报一下的错误，这个错误是因为 decoder 没有返回梯度 gradient ，这是什么原因造成的他没有返回梯度呢？有什么建议吗？

In my model training, if "find_unused_parameters=False", it will raise an error as follows: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by making sure all forward function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). Parameters which did not receive grad for rank 1: decoder.0.norm1.weight, decoder.0.norm1.bias, decoder.0.self_attn.qkv.weight, decoder.0.self_attn.proj.weight, decoder.0.self_attn.proj.bias, decoder.0.norm_q.weight, decoder.0.norm_q.bias, decoder.0.norm_v.weight, decoder.0.norm_v.bias, decoder.0.cross_attn.q_map.weight, decoder.0.cross_attn.k_map.weight, decoder.0.cross_attn.v_map.weight, decoder.0.cross_attn.proj.weight, decoder.0.cross_attn.proj.bias, decoder.0.norm2.weight, decoder.0.norm2.bias, decoder.0.mlp.fc1.weight, decoder.0.mlp.fc1.bias, decoder.0.mlp.fc2.weight, decoder.0.mlp.fc2.bias, decoder.1.norm1.weight, decoder.1.norm1.bias, decoder.1.self_attn.qkv.weight, decoder.1.self_attn.proj.weight, decoder.1.self_attn.proj.bias, decoder.1.norm_q.weight, decoder.1.norm_q.bias, decoder.1.norm_v.weight, decoder.1.norm_v.bias, decoder.1.cross_attn.q_map.weight, decoder.1.cross_attn.k_map.weight, decoder.1.cross_attn.v_map.weight, decoder.1.cross_attn.proj.weight, decoder.1.cross_attn.proj.bias, decoder.1.norm2.weight, decoder.1.norm2.bias, decoder.1.mlp.fc1.weight, decoder.1.mlp.fc1.bias, decoder.1.mlp.fc2.weight, decoder.1.mlp.fc2.bias Parameter indices which did not receive grad for rank 1: 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 463885) of binary: /home/mnt/xyqian/miniconda3/envs/detector_21806_2/bin/python /home/mnt/xyqian/miniconda3/envs/detector_21806_2/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py:367: UserWarning:

目前尚无回复

weight attn OSS the