1. 编译环境准备
1.1 Docker 代理
这一步笔者遇到了很多问题,想多讲一点。

笔者的环境:
- 多用户的 Ubuntu Server,大部分用户没有 root 权限(这里只讨论非 root 用户);
- 每个用户共用 Docker 基础组件,在用户目录下配置独立的 Docker 配置文件;
- 所有用户均在
docker
用户组中,可以使用 docker
命令,且 docker
运行在 rootless
模式。
为什么要把这一节单独拎出来 ?

在上述的环境限制下,笔者发现 Docker 的 Rootless 模式存在以下几个问题:
rootless
模式下运行的容器内对 host 的 network 环境并不可见(即便启动时指定了--network=host
选项)。
- 给
Docker.daemon
配置的网络代理(实质是给 systemd
配置代理),对于非 root 用户并不生效(详见下面的 rootless1.log
)。
rootless
模式下,用户创建容器不具备 Docker的很多关键特性(详见下面的 rootless1.log
)。
既然 Rootless 存在诸多限制和不便,那我们不如将其禁用(至于安全性问题,毕竟往前倒退 5 年,Docker 都还不支持 Rootless)。
Rootless
模式下的 Docker 属性(rootless1.log
):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
|
$ docker info
Client: Docker Engine - Community
Version: 27.3.1
Context: rootless
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.17.1
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.29.7
Path: /usr/libexec/docker/cli-plugins/docker-compose
Server:
Containers: 1
Running: 1
Paused: 0
Stopped: 0
Images: 1
Server Version: 27.3.1
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: true
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: inactive
Runtimes: nvidia runc io.containerd.runc.v2
Default Runtime: runc
Init Binary: docker-init
containerd version: 7f7fdf5fed64eb6a7caf99b3e12efcf9d60e311c
runc version: v1.1.14-0-g2c9f560
init version: de40ad0
Security Options:
seccomp
Profile: builtin
rootless
cgroupns
Kernel Version: 6.8.0-49-generic
Operating System: Ubuntu 22.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 128
Total Memory: 503.5GiB
Name: XXXXXXX
ID: 7cee499f-2387-48c2-8a51-ebe4754ac250
Docker Root Dir: /home/shikai/.local/share/docker
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No cpu cfs quota support
WARNING: No cpu cfs period support
WARNING: No cpu shares support
WARNING: No cpuset support
WARNING: No io.weight support
WARNING: No io.weight (per device) support
WARNING: No io.max (rbps) support
WARNING: No io.max (wbps) support
WARNING: No io.max (riops) support
WARNING: No io.max (wiops) support
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
|
多用户使用 Server 时,对某用户限制硬件资源使用率有时候是很必要的,Docker 的 Rootless 模式并不支持这一特性。
Root
模式下的 Docker 属性(rootless2.log
):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
|
$ sudo docker info
Client: Docker Engine - Community
Version: 27.3.1
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.17.1
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.29.7
Path: /usr/libexec/docker/cli-plugins/docker-compose
Server:
Containers: 4
Running: 2
Paused: 0
Stopped: 2
Images: 7
Server Version: 27.3.1
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 nvidia runc
Default Runtime: nvidia
Init Binary: docker-init
containerd version: 7f7fdf5fed64eb6a7caf99b3e12efcf9d60e311c
runc version: v1.1.14-0-g2c9f560
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
cgroupns
Kernel Version: 6.8.0-49-generic
Operating System: Ubuntu 22.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 128
Total Memory: 503.5GiB
Name: XXXXXXX
ID: 921954a5-0225-4152-bfc1-abd0d1893845
Docker Root Dir: /var/lib/docker
Debug Mode: false
HTTP Proxy: http://127.0.0.1:7890
HTTPS Proxy: http://127.0.0.1:7890
Experimental: false
Insecure Registries:
127.0.0.0/8
Registry Mirrors:
https://docker.mirrors.ustc.edu.cn/
https://docker.nju.edu.cn/
https://mirror.baidubce.com/
Live Restore Enabled: false
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
|
而 root
模式下的 Docker 正确加载了代理,且具备完整的特性。
1
2
3
4
5
6
7
8
9
|
sudo systemctl stop docker.service
# 安装 Docker-CE Rootless 的额外组件
sudo apt install docker-ce-rootless-extras -y
# 移除 Rootless 模式下产生的用户数据
/usr/bin/rootlesskit rm -rf /home/shikai/.local/share/docker
sudo systemctl start docker.service
|
1. 配置 APT
仓库
1
2
3
4
|
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
|
2. 启用实验包
1
|
sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list
|
3. 更新仓库
4. 安装 nvidia-container-toolkit
1
|
sudo apt-get install -y nvidia-container-toolkit
|
2 编译 PaddlePaddle
2.1 克隆 PaddlePaddle 源码
1
|
git clone https://github.com/PaddlePaddle/Paddle.git
|
2.2 拉取 PaddlePaddle Docker 镜像
笔者本地 CUDA
版本是 12.3,cuDNN
版本是 9.0,TensorRT
版本是 8.6。版本不适配可以去 CUDA archive
下载对应的版本。
1
|
docker pull registry.baidubce.com/paddlepaddle/paddle:3.0.0b2-gpu-cuda12.3-cudnn9.0-trt8.6
|
2.3 运行 PaddlePaddle Docker 镜像
--gpus all
表示在 Docker 容器中允许使用所有 gpu;
1
2
3
4
5
6
7
8
9
10
11
|
# 进入 PaddlePaddle 源码目录
cd Paddle
# 运行 Docker 镜像
docker run -it \
--gpus all \
--name paddle-albresky \
-v $PATH_TO_PADDLE_REPO:/paddle \
--network=host \
registry.baidubce.com/paddlepaddle/paddle:3.0.0b2-gpu-cuda12.3-cudnn9.0-trt8.6 \
/bin/bash
|
这里 Docker 报错:Error response from daemon: could not select device driver
1
|
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
|
安装 nvidia-docker2
解决:
1
2
3
4
5
6
|
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
|
2.4 编译 PaddlePaddle
1
2
3
4
5
6
7
8
9
10
11
12
|
# 在 Docker 容器中执行以下命令:
cd /paddle
mkdir -p /paddle/build && cd /paddle/build
pip3.10 install protobuf
apt install patchelf # 用于修改 ELF 可执行文件的动态链接器和 RPATH
git config --global --add safe.directory /paddle
# 生成 Makefile
cmake .. -DPY_VERSION=3.10 -DWITH_GPU=ON
# 编译
make -j$(nproc)
|
2.5 安装 PaddlePaddle
1
2
|
cd /paddle/build/python/dist
pip3.10 install -U the_built_wheel.whl
|
3 遇到的问题
3.1 内存不够
起初用 128
个 jobs
进行编译(CPU 为 128
核心),中途报 137
error code,表示内存不够:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
|
[ 10%] Completed 'extern_onednn'
[ 10%] Built target extern_onednn
[ 63%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim160_bf16_causal_sm80.cu.o
Scanning dependencies of target onednn_cmd
[ 10%] Generating third_party/install/onednn/libdnnl.so.3
[ 10%] Built target onednn_cmd
Scanning dependencies of target onednn
[ 11%] Building C object CMakeFiles/onednn.dir/onednn_dummy.c.o
[ 11%] Linking C static library libonednn.a
[ 11%] Built target onednn
[ 63%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim160_bf16_densemask_sm80.cu.o
[ 64%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim160_bf16_flashmask_sm80.cu.o
[ 64%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim160_bf16_sm80.cu.o
[ 65%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim160_fp16_causal_densemask_sm80.cu.o
[ 65%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim160_fp16_causal_flashmask_sm80.cu.o
[ 66%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim160_fp16_causal_sm80.cu.o
[ 66%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim160_fp16_densemask_sm80.cu.o
[ 66%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim160_fp16_flashmask_sm80.cu.o
[ 67%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim160_fp16_sm80.cu.o
[ 67%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim192_bf16_causal_densemask_sm80.cu.o
[ 68%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim192_bf16_causal_flashmask_sm80.cu.o
[ 68%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim192_bf16_causal_sm80.cu.o
[ 69%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim192_bf16_densemask_sm80.cu.o
[ 69%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim192_bf16_flashmask_sm80.cu.o
[ 70%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim192_bf16_sm80.cu.o
[ 70%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim192_fp16_causal_densemask_sm80.cu.o
[ 71%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim192_fp16_causal_flashmask_sm80.cu.o
Killed
make[5]: *** [CMakeFiles/flashattn.dir/build.make:394: CMakeFiles/flashattn.dir/flash_attn/src/flash_bwd_hdim128_fp16_causal_densemask_sm80.cu.o] Error 137
make[5]: *** Waiting for unfinished jobs....
Killed
make[5]: *** [CMakeFiles/flashattn.dir/build.make:407: CMakeFiles/flashattn.dir/flash_attn/src/flash_bwd_hdim128_fp16_causal_flashmask_sm80.cu.o] Error 137
Killed
make[5]: *** [CMakeFiles/flashattn.dir/build.make:1343: CMakeFiles/flashattn.dir/flash_attn/src/flash_bwd_hdim64_fp16_causal_flashmask_sm80.cu.o] Error 137
Killed
make[5]: *** [CMakeFiles/flashattn.dir/build.make:459: CMakeFiles/flashattn.dir/flash_attn/src/flash_bwd_hdim128_fp16_sm80.cu.o] Error 137
Killed
make[5]: *** [CMakeFiles/flashattn.dir/build.make:420: CMakeFiles/flashattn.dir/flash_attn/src/flash_bwd_hdim128_fp16_causal_sm80.cu.o] Error 137
|
解决方法: 降低 jobs
数量,笔者降低到 96
个 jobs,编译成功:
1
2
3
4
5
6
|
[ 77%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim96_fp16_causal_sm80.cu.o
[ 78%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim96_fp16_densemask_sm80.cu.o
[ 78%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim96_fp16_flashmask_sm80.cu.o
[ 79%] Building CUDA object CMakeFiles/flashattn.dir/flash_attn/src/flash_fwd_hdim96_fp16_sm80.cu.o
[ 79%] Linking CUDA shared library libflashattn.so
[100%] Built target flashattn
|
1
2
3
4
5
6
|
# docker run --gpus all -it registry.baidubce.com/paddlepaddle/paddle:3.0.0b2-gpu-cuda12.3-cudnn9.0-trt8.6
Digest: sha256:996c18b0958a4e655ae5e9d3d8f35c91971bba7ab55024c85ff60d65242ecefa
Status: Downloaded newer image for registry.baidubce.com/paddlepaddle/paddle:3.0.0b2-gpu-cuda12.3-cudnn9.0-trt8.6
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: failed to add device rules: unable to find any existing device filters attached to the cgroup: bpf_prog_query(BPF_CGROUP_DEVICE) failed: operation not permitted: unknown.
|
参考资料