deepspeed 训练多机多卡报错 ncclSystemError Last error - 服务器托管|北京服务器租用|机房托管租用|IDC托管租用|机房机柜带宽租用-价格及费用咨询

最近在搞分布式训练大模型，踩了两个晚上的坑今天终于爬出来了

我们使用 2台 8*H100

遇到过

错误1

10.255.19.85: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
10.255.19.85: Last error:
10.255.19.85: socketStartConnect: Connect to 127.0.0.1 failed : Software caused connection abort

错误2

10.255.19.82: torch.distributed.DistBackendError: [7] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key ‘0’, but store->get(‘0’) got error: Connection reset by

错误3

10.255.19.85: ncclInternalError: Internal check failed.
10.255.19.85: Last error:
10.255.19.85: Bootstrap : no socket interface found

其实这三个错误都是一个问题导致的，就是网卡配置的问题

这是之前的配置

hostfile

10.255.19.82 slots=8
10.255.19.85 slots=8

fine-tune.sh

hostfile="/data2/xinyuuliu/Baichuan2-main/fine-tune/hostfile"

export NCCL_SOCKET_IFNAME=enp194s0f0
# export MASTER_ADDR=10.255.19.82   # 主节点的IP地址
# export MASTER_PORT=29500
# --include localhost:0,1,2,3,4,5,6,7
export NCCL_DEBUG=INFO
# export NCCL_IB_TIMEOUT=22
# export NCCL_IB_GID_INDEX=3
# export NCCL_IB_TC=128
# export NCCL_IB_DISABLE=1

deepspeed --master_addr 10.255.19.82 --master_port 29500 --hostfile=$hostfile fine-tune.py  
    --report_to "none" 
    --data_path "/data2/xinyuuliu/Baichuan2-main/fine-tune/data/全网评价总结训练数据.json" 
    --model_name_or_path "/data1/xinyuuliu/Baichuan2-13B-Chat" 
    --output_dir "output_lora_summary" 
    --model_max_length  10000
    --num_train_epochs 10 
    --per_device_train_batch_size 4 
    --gradient_accumulation_steps 1 
    --save_strategy epoch 
    --learning_rate 2e-4 
    --lr_scheduler_type constant 
    --adam_beta1 0.9 
    --adam_beta2 0.98 
    --adam_epsilon 1e-8 
    --max_grad_norm 1.0 
    --weight_decay 1e-4 
    --warmup_ratio 0.0 
服务器托管    --logging_steps 1 
    --gradient_checkpointing True 
    --deepspeed ds_config.json 
    --bf16 True 
    --tf32 True 
    --use_lora True 
    # --load_lora_path /data2/xinyuuliu/Baichuan2-main/fine-tune/output_lora4_1_1/checkpoint-7516
    # --use_NEFT True
    # --use_frozen True

 #"/data2/xinyuuliu/baichuan2_7B"

修改后之后的配置

问题主要出现在NCCL_SOCKET_IFNAME 这个环境变量，这个环境变量会被携带到其他机器上，但是网卡的名称是不一样的，尤其我的两台机器包含很多的网卡，因此，我们配置的时候需要把那些没用的虚拟网卡屏蔽掉，只留下需要的

解决方法，这个要根据实际情况来改，英伟达官方写法如下，屏蔽显卡用^开头

export NCCL_SOCKET_IFNAME=^br-c485a8390817,docker0,eno2,ens6f0,ens6f1,enx46a838614d5f,lo,veth23d7383,br-dcd3e4ec14e7,enp194s0f1,ens6f0,ens6f1,enxe278666d5a52,veth110d0b7,veth215ea4e,veth3203d6b,veth87c3cbf,vethec6fc79,virbr0

我的网卡太多了，之前一直以为环境变量在 /etc/profile 下配置各自的就行，

然后再source/etc/profile，结果发现不对，从机也都指向了主机的网卡，没办法建立socket

hostfile="/data2/xinyuuliu/Baichuan2-main/fine-tune/hostfile"

export NCCL_SOCKET_IFNAME=^br-c485a8390817,docker0,eno2,ens6f0,ens6f1,enx46a838614d5f,lo,veth23d7383,br-dcd3e4ec14e7,enp194s0f1,ens6f0,ens6f1,enxe278666d5a52,veth110d0b7,veth215ea4e,veth3203d6b,veth87c3cbf,vethec6fc79,virbr0
# export NCCL_SOCKET_IFNAME=enp194s0f0
# export MASTER_ADDR=10.255.19.82   # 主节点的IP地址
# export MASTER_PORT=29500
# --include localhost:0,1,2,3,4,5,6,7
export NCCL_DEBUG=INFO
# export NCCL_IB_TIMEOUT=22
# export NCCL_IB_GID_INDEX=3
# export NCCL_IB_TC=128
# export NCCL_IB_DISABLE=1

deepspeed --master_addr 10.255.19.82 --master_port 29500 --hostfile=$hostfile fine-tune.py  
    --report_to "none" 
    --data_path "/data2/xinyuuliu/Baichuan2-main/fine-tune/d服务器托管ata/全网评价总结训练数据.json" 
    --model_name_or_path "/data1/xinyuuliu/Baichuan2-13B-Chat" 
    --output_dir "output_lora_summary" 
    --model_max_length  10000
    --num_train_epochs 10 
    --per_device_train_batch_size 4 
    --gradient_accumulation_steps 1 
    --save_strategy epoch 
    --learning_rate 2e-4 
    --lr_scheduler_type constant 
    --adam_beta1 0.9 
    --adam_beta2 0.98 
    --adam_epsilon 1e-8 
    --max_grad_norm 1.0 
    --weight_decay 1e-4 
    --warmup_ratio 0.0 
    --logging_steps 1 
    --gradient_checkpointing True 
    --deepspeed ds_config.json 
    --bf16 True 
    --tf32 True 
    --use_lora True 
    # --load_lora_path /data2/xinyuuliu/Baichuan2-main/fine-tune/output_lora4_1_1/checkpoint-7516
    # --use_NEFT True
    # --use_frozen True

 #"/data2/xinyuuliu/baichuan2_7B"

修改后成功跑起16*H100，爽歪歪

服务器托管，北京服务器托管，服务器租用 http://www.fwqtg.net
机房租用，北京机房租用，IDC机房托管， http://www.fwqtg.net

相关推荐: 5.6 Linux rsync 服务

1、rsync 概念介绍官方网站：rsync rsync(Remote Sync) 是一个Unix/linux系统下的文件同步和传输工具。Rsync通过“rsync算法”提供了一个客户机和远程服务器的文件同步的快速方法。采用C/S模式端口tcp:873 …

服务器托管，北京服务器托管，服务器租用，机房机柜带宽租用