使用 Docker 从源码编译 PaddlePaddle with NVIDIA GPU

1. 编译环境准备

1.1 Docker 代理

这一步笔者遇到了很多问题,想多讲一点。

/paddle_build_from_docker_nv_gpu/image/index/hardware.png

笔者的环境:

  • 多用户的 Ubuntu Server,大部分用户没有 root 权限(这里只讨论非 root 用户);
  • 每个用户共用 Docker 基础组件,在用户目录下配置独立的 Docker 配置文件;
  • 所有用户均在 docker 用户组中,可以使用 docker 命令,且 docker 运行在 rootless 模式。

为什么要把这一节单独拎出来 ?

/paddle_build_from_docker_nv_gpu/image/index/docker_call.webp

在上述的环境限制下,笔者发现 Docker 的 Rootless 模式存在以下几个问题:

  • rootless 模式下运行的容器内对 host 的 network 环境并不可见(即便启动时指定了--network=host 选项)。
  • Docker.daemon 配置的网络代理(实质是给 systemd 配置代理),对于非 root 用户并不生效(详见下面的 rootless1.log)。
  • rootless 模式下,用户创建容器不具备 Docker的很多关键特性(详见下面的 rootless1.log)。

既然 Rootless 存在诸多限制和不便,那我们不如将其禁用(至于安全性问题,毕竟往前倒退 5 年,Docker 都还不支持 Rootless)。

Rootless 模式下的 Docker 属性(rootless1.log):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
$ docker info

Client: Docker Engine - Community
 Version:    27.3.1
 Context:    rootless
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.17.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.29.7
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 1
  Running: 1
  Paused: 0
  Stopped: 0
 Images: 1
 Server Version: 27.3.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: true
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: nvidia runc io.containerd.runc.v2
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7f7fdf5fed64eb6a7caf99b3e12efcf9d60e311c
 runc version: v1.1.14-0-g2c9f560
 init version: de40ad0
 Security Options:
  seccomp
   Profile: builtin
  rootless
  cgroupns
 Kernel Version: 6.8.0-49-generic
 Operating System: Ubuntu 22.04.5 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 128
 Total Memory: 503.5GiB
 Name: XXXXXXX
 ID: 7cee499f-2387-48c2-8a51-ebe4754ac250
 Docker Root Dir: /home/shikai/.local/share/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No cpu cfs quota support
WARNING: No cpu cfs period support
WARNING: No cpu shares support
WARNING: No cpuset support
WARNING: No io.weight support
WARNING: No io.weight (per device) support
WARNING: No io.max (rbps) support
WARNING: No io.max (wbps) support
WARNING: No io.max (riops) support
WARNING: No io.max (wiops) support
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled

多用户使用 Server 时,对某用户限制硬件资源使用率有时候是很必要的,Docker 的 Rootless 模式并不支持这一特性。

Root 模式下的 Docker 属性(rootless2.log):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
$ sudo docker info

Client: Docker Engine - Community
 Version:    27.3.1
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.17.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.29.7
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 4
  Running: 2
  Paused: 0
  Stopped: 2
 Images: 7
 Server Version: 27.3.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 nvidia runc
 Default Runtime: nvidia
 Init Binary: docker-init
 containerd version: 7f7fdf5fed64eb6a7caf99b3e12efcf9d60e311c
 runc version: v1.1.14-0-g2c9f560
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.8.0-49-generic
 Operating System: Ubuntu 22.04.5 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 128
 Total Memory: 503.5GiB
 Name: XXXXXXX
 ID: 921954a5-0225-4152-bfc1-abd0d1893845
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 HTTP Proxy: http://127.0.0.1:7890
 HTTPS Proxy: http://127.0.0.1:7890
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Registry Mirrors:
  https://docker.mirrors.ustc.edu.cn/
  https://docker.nju.edu.cn/
  https://mirror.baidubce.com/
 Live Restore Enabled: false

WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled

root 模式下的 Docker 正确加载了代理,且具备完整的特性。

  • 如何卸载 Docker Rootless?
1
2
3
4
5
6
7
8
9
sudo systemctl stop docker.service

# 安装 Docker-CE Rootless 的额外组件
sudo apt install docker-ce-rootless-extras -y

# 移除 Rootless 模式下产生的用户数据
/usr/bin/rootlesskit rm -rf /home/shikai/.local/share/docker

sudo systemctl start docker.service

1.2 NVIDIA Container Toolkit 安装

1. 配置 APT 仓库

1
2
3
4
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ 
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ 
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ 
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

2. 启用实验包

1
sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list

3. 更新仓库

1
sudo apt-get update

4. 安装 nvidia-container-toolkit

1
sudo apt-get install -y nvidia-container-toolkit

2 编译 PaddlePaddle

2.1 克隆 PaddlePaddle 源码

1
git clone https://github.com/PaddlePaddle/Paddle.git

2.2 拉取 PaddlePaddle Docker 镜像

笔者本地 CUDA 版本是 12.3,cuDNN 版本是 9.0,TensorRT 版本是 8.6。版本不适配可以去 CUDA archive 下载对应的版本。

1
docker pull registry.baidubce.com/paddlepaddle/paddle:3.0.0b2-gpu-cuda12.3-cudnn9.0-trt8.6

2.3 运行 PaddlePaddle Docker 镜像

--gpus all 表示在 Docker 容器中允许使用所有 gpu;

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# 进入 PaddlePaddle 源码目录
cd Paddle 

# 运行 Docker 镜像
docker run -it \
--gpus all \
--name paddle-albresky \
-v $PATH_TO_PADDLE_REPO:/paddle \
--network=host  \
registry.baidubce.com/paddlepaddle/paddle:3.0.0b2-gpu-cuda12.3-cudnn9.0-trt8.6 \
/bin/bash

这里 Docker 报错:Error response from daemon: could not select device driver

1
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

安装 nvidia-docker2 解决:

1
2
3
4
5
6
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker

2.4 编译 PaddlePaddle

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# 在 Docker 容器中执行以下命令:
cd /paddle
mkdir -p /paddle/build && cd /paddle/build
pip3.10 install protobuf
apt install patchelf # 用于修改 ELF 可执行文件的动态链接器和 RPATH
git config --global --add safe.directory /paddle

# 生成 Makefile
cmake .. -DPY_VERSION=3.10 -DWITH_GPU=ON

# 编译
make -j$(nproc)

2.5 安装 PaddlePaddle

1
2
cd /paddle/build/python/dist
pip3.10 install -U the_built_wheel.whl

3 遇到的问题

3.1 内存不够

起初用 128jobs 进行编译(CPU 为 128 核心),中途报 137 error code,表示内存不够:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
[ 10%] Completed 'extern_onednn'
[ 10%] Built target extern_onednn
[ 63%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim160_bf16_causal_sm80.cu.o
Scanning dependencies of target onednn_cmd
[ 10%] Generating third_party/install/onednn/libdnnl.so.3
[ 10%] Built target onednn_cmd
Scanning dependencies of target onednn
[ 11%] Building C object CMakeFiles/onednn.dir/onednn_dummy.c.o
[ 11%] Linking C static library libonednn.a
[ 11%] Built target onednn
[ 63%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim160_bf16_densemask_sm80.cu.o
[ 64%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim160_bf16_flashmask_sm80.cu.o
[ 64%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim160_bf16_sm80.cu.o
[ 65%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim160_fp16_causal_densemask_sm80.cu.o
[ 65%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim160_fp16_causal_flashmask_sm80.cu.o
[ 66%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim160_fp16_causal_sm80.cu.o
[ 66%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim160_fp16_densemask_sm80.cu.o
[ 66%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim160_fp16_flashmask_sm80.cu.o
[ 67%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim160_fp16_sm80.cu.o
[ 67%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim192_bf16_causal_densemask_sm80.cu.o
[ 68%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim192_bf16_causal_flashmask_sm80.cu.o
[ 68%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim192_bf16_causal_sm80.cu.o
[ 69%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim192_bf16_densemask_sm80.cu.o
[ 69%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim192_bf16_flashmask_sm80.cu.o
[ 70%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim192_bf16_sm80.cu.o
[ 70%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim192_fp16_causal_densemask_sm80.cu.o
[ 71%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim192_fp16_causal_flashmask_sm80.cu.o
Killed
make[5]: *** [CMakeFiles/flashattn.dir/build.make:394: CMakeFiles/flashattn.dir/flash_attn/src/flash_bwd_hdim128_fp16_causal_densemask_sm80.cu.o] Error 137
make[5]: *** Waiting for unfinished jobs....
Killed
make[5]: *** [CMakeFiles/flashattn.dir/build.make:407: CMakeFiles/flashattn.dir/flash_attn/src/flash_bwd_hdim128_fp16_causal_flashmask_sm80.cu.o] Error 137
Killed
make[5]: *** [CMakeFiles/flashattn.dir/build.make:1343: CMakeFiles/flashattn.dir/flash_attn/src/flash_bwd_hdim64_fp16_causal_flashmask_sm80.cu.o] Error 137
Killed
make[5]: *** [CMakeFiles/flashattn.dir/build.make:459: CMakeFiles/flashattn.dir/flash_attn/src/flash_bwd_hdim128_fp16_sm80.cu.o] Error 137
Killed
make[5]: *** [CMakeFiles/flashattn.dir/build.make:420: CMakeFiles/flashattn.dir/flash_attn/src/flash_bwd_hdim128_fp16_causal_sm80.cu.o] Error 137

解决方法: 降低 jobs 数量,笔者降低到 96 个 jobs,编译成功:

1
2
3
4
5
6
[ 77%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim96_fp16_causal_sm80.cu.o
[ 78%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim96_fp16_densemask_sm80.cu.o
[ 78%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim96_fp16_flashmask_sm80.cu.o
[ 79%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim96_fp16_sm80.cu.o
[ 79%] Linking CUDA shared library libflashattn.so
[100%] Built target flashattn
1
2
3
4
5
6
# docker run --gpus all -it registry.baidubce.com/paddlepaddle/paddle:3.0.0b2-gpu-cuda12.3-cudnn9.0-trt8.6

Digest: sha256:996c18b0958a4e655ae5e9d3d8f35c91971bba7ab55024c85ff60d65242ecefa
Status: Downloaded newer image for registry.baidubce.com/paddlepaddle/paddle:3.0.0b2-gpu-cuda12.3-cudnn9.0-trt8.6
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: failed to add device rules: unable to find any existing device filters attached to the cgroup: bpf_prog_query(BPF_CGROUP_DEVICE) failed: operation not permitted: unknown.

参考资料

给作者倒杯卡布奇诺 ~
Albresky 支付宝支付宝
Albresky 微信微信