Decentralized Computing with Bacalhau
Run a computing node with Bacalhau and IPFS
Table of Contents
- What’s this?
- Step 1 - Run IPFS Kubo
- Step 2 - Check your IPFS node
- Step 3 Nvidia support for docker container
- Step 4 (Optional) - Troubleshoot nvidia docker support
- Step 5 Start Bacalhau
- Step 6 Verify
What’s this?
Decentralized computing with Bacalhau and IPFS.
Steps to setup your own Linux node with an nvidia GPU to accept computes from the Bacalhau public network.
Step 1 - Run IPFS Kubo
https://docs.bacalhau.org/running-node/quick-start-docker
Create Docker Network
docker network create --driver bridge bacalhau-network
Start IPFS Kudo node in a docker container and connect it to the bacalhau-network I just created
docker run \
-d --rm --name ipfs_kubo \
--network bacalhau-network \
-v ~/ipfs-testing/export:/export -v ~/ipfs-testing/data:/data/ipfs \
-p 4001:4001 -p 4001:4001/udp -p 127.0.0.1:8080:8080 -p 127.0.0.1:5001:5001 \
ipfs/kubo:latest
Step 2 - Check your IPFS node
Verify daemon is running with no errors with
docker logs ipfs_kubo
Make sure yo don’t have a message similar to this (you probably will):
failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 2048 kiB, got: 416 kiB). See https://github.com/quic-go/quic-go/wiki/UDP-Buffer-Sizes for details.
If you do have that error, you can view your current UDP buffer sizes with:
$ sysctl net.core.rmem_max
net.core.rmem_max = 212992
$ sysctl net.core.wmem_max
net.core.wmem_max = 212992
And increase them with the following command. Take into account that these values might get reset again to their defaults after restart.
sudo sysctl -w net.core.rmem_max=2500000
sudo sysctl -w net.core.wmem_max=2500000
Then stop your node and run it again to confirm the error has gone.
docker stop
Step 3 Nvidia support for docker container
NVIDIA no longer maintains the nvidia-docker2
package, so it’s no longer recommended to use it. The new recommended way to run CUDA containers is with Docker using the NVIDIA Container Toolkit.
Ensure you’ve set it up correctly.
I use a Tuxedo computer with Tuxedo OS so it’s already installed for me. I can check it with apt show nvidia-container-toolkit
, however the version I have it too old so I’ll need to update it. (see next step)
If you don’t have it installed, follow the official install steps
Once installed, you can check your nvidia driver details with: nvidia-smi
and test it works within containers with docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu20.04 nvidia-smi
. Make sure you choose the same cuda version you have installed. On my case it’s 12.2.0
.
$ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3070 ... Off | 00000000:01:00.0 On | N/A |
| N/A 48C P8 18W / 125W | 1424MiB / 8192MiB | 2% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1456 G /usr/lib/xorg/Xorg 498MiB |
| 0 N/A N/A 2145 G ...-gnu/libexec/xdg-desktop-portal-kde 3MiB |
| 0 N/A N/A 2178 G /usr/bin/ksmserver 3MiB |
| 0 N/A N/A 2181 G /usr/bin/kded5 3MiB |
| 0 N/A N/A 2182 G /usr/bin/kwin_x11 148MiB |
| 0 N/A N/A 2209 G /usr/bin/plasmashell 25MiB |
| 0 N/A N/A 2265 G ...c/polkit-kde-authentication-agent-1 3MiB |
| 0 N/A N/A 2360 G ...86_64-linux-gnu/libexec/kdeconnectd 3MiB |
| 0 N/A N/A 2370 G /usr/bin/kaccess 3MiB |
| 0 N/A N/A 2377 G ...-linux-gnu/libexec/DiscoverNotifier 3MiB |
| 0 N/A N/A 2444 G ...AAAAAAAACAAAAAAAAAA= --shared-files 3MiB |
| 0 N/A N/A 2456 G /usr/bin/okular 3MiB |
| 0 N/A N/A 2516 G /usr/bin/systemsettings5 87MiB |
| 0 N/A N/A 2550 G /usr/lib/firefox/firefox 153MiB |
| 0 N/A N/A 2670 G /usr/bin/konsole 3MiB |
| 0 N/A N/A 2680 G /usr/bin/kwalletd5 3MiB |
| 0 N/A N/A 3335 G /usr/lib/thunderbird/thunderbird 181MiB |
| 0 N/A N/A 13042 G ..._64-linux-gnu/libexec/kf5/klauncher 3MiB |
| 0 N/A N/A 13045 G /usr/bin/kwalletmanager5 3MiB |
| 0 N/A N/A 13098 G ...86_64-linux-gnu/libexec/baloorunner 3MiB |
| 0 N/A N/A 13191 G ...,WinRetrieveSuggestionsOnlyOnDemand 67MiB |
| 0 N/A N/A 14067 G ...sion,SpareRendererForSitePerProcess 144MiB |
| 0 N/A N/A 28664 G /usr/bin/konsole 3MiB |
+---------------------------------------------------------------------------------------+
Step 4 (Optional) - Troubleshoot nvidia docker support
In my case, I got the following error
$ docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu20.04 nvidia-smi
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.
More detailed explanation of the solution is here.
My current version
$ nvidia-container-cli --version
version: 1.3.0
build date: 2020-09-16T12:32+00:00
build revision: 16315ebdf4b9728e899f615e208b50c41d7a5d15
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
Add nvidia libnvidia container repo:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \
&& \
sudo apt-get update
As I already had it installed, I just upgrade:
sudo apt upgrade
I now verify my version again and see I have v.1.14.2 installed instead.
$ nvidia-container-cli --version
cli-version: 1.14.2
lib-version: 1.14.2
build date: 2023-09-25T10:10+00:00
build revision: 1eb5a30a6ad0415550a9df632ac8832bf7e2bbba
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
I now verify nvidia is working in your containers:
$ docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu20.04 nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3070 ... Off | 00000000:01:00.0 On | N/A |
| N/A 49C P8 18W / 125W | 1278MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
Step 5 Start Bacalhau
docker run \
-d --rm --name bacalhau \
--gpus all \
--net host \
--env BACALHAU_ENVIRONMENT=production \
-v /var/run/docker.sock:/var/run/docker.sock \
-v /tmp:/tmp \
-u root \
ghcr.io/bacalhau-project/bacalhau:latest \
serve \
--ipfs-connect /dns4/localhost/tcp/5001 \
--node-type compute \
--private-internal-ipfs=false \
--peer env
Step 6 Verify
Check you are running the same client and server versions:
docker exec bacalhau bacalhau --api-host=localhost version
CLIENT SERVER LATEST
v1.1.1 v1.1.1 v1.1.1
Check you are connected to the network (last 3 lines here show that we are):
$ sudo lsof -i :1235
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
bacalhau 9663 root 3u IPv4 103929 0t0 TCP *:1235 (LISTEN)
bacalhau 9663 root 7u IPv4 103933 0t0 UDP *:1235
bacalhau 9663 root 8u IPv6 103934 0t0 TCP *:1235 (LISTEN)
bacalhau 9663 root 9u IPv6 103935 0t0 UDP *:1235
bacalhau 9663 root 12u IPv4 101126 0t0 TCP tuxedo:1235->191.115.245.35.bc.googleusercontent.com:1235 (ESTABLISHED)
bacalhau 9663 root 13u IPv4 40799 0t0 TCP tuxedo:1235->251.61.245.35.bc.googleusercontent.com:1235 (ESTABLISHED)
bacalhau 9663 root 14u IPv4 101157 0t0 TCP tuxedo:1235->239.251.245.35.bc.googleusercontent.com:1235 (ESTABLISHED)
https://dashboard.bacalhau.org/ is currently a bit faulty so instead to view your current jobs its better to:
docker exec bacalhau bacalhau job list