Run a computing node with Bacalhau and IPFS

Table of Contents

What’s this?

Decentralized computing with Bacalhau and IPFS.

Steps to setup your own Linux node with an nvidia GPU to accept computes from the Bacalhau public network.

Step 1 - Run IPFS Kubo

Create Docker Network

docker network create --driver bridge bacalhau-network

Start IPFS Kudo node in a docker container and connect it to the bacalhau-network I just created

docker run \
    -d --rm --name ipfs_kubo \
    --network bacalhau-network \
    -v ~/ipfs-testing/export:/export -v ~/ipfs-testing/data:/data/ipfs \
    -p 4001:4001 -p 4001:4001/udp -p -p \

Step 2 - Check your IPFS node

Verify daemon is running with no errors with

docker logs ipfs_kubo

Make sure yo don’t have a message similar to this (you probably will):

failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 2048 kiB, got: 416 kiB). See for details.

If you do have that error, you can view your current UDP buffer sizes with:

$ sysctl net.core.rmem_max
  net.core.rmem_max = 212992
$ sysctl net.core.wmem_max
  net.core.wmem_max = 212992

And increase them with the following command. Take into account that these values might get reset again to their defaults after restart.

sudo sysctl -w net.core.rmem_max=2500000
sudo sysctl -w net.core.wmem_max=2500000

Then stop your node and run it again to confirm the error has gone.

docker stop 

Step 3 Nvidia support for docker container

NVIDIA no longer maintains the nvidia-docker2 package, so it’s no longer recommended to use it. The new recommended way to run CUDA containers is with Docker using the NVIDIA Container Toolkit.

Ensure you’ve set it up correctly.

I use a Tuxedo computer with Tuxedo OS so it’s already installed for me. I can check it with apt show nvidia-container-toolkit, however the version I have it too old so I’ll need to update it. (see next step)

If you don’t have it installed, follow the official install steps

Once installed, you can check your nvidia driver details with: nvidia-smi and test it works within containers with docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu20.04 nvidia-smi. Make sure you choose the same cuda version you have installed. On my case it’s 12.2.0.

$ nvidia-smi                                                                                  
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 3070 ...    Off | 00000000:01:00.0  On |                  N/A |
| N/A   48C    P8              18W / 125W |   1424MiB /  8192MiB |      2%      Default |
|                                         |                      |                  N/A |
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|    0   N/A  N/A      1456      G   /usr/lib/xorg/Xorg                          498MiB |
|    0   N/A  N/A      2145      G   ...-gnu/libexec/xdg-desktop-portal-kde        3MiB |
|    0   N/A  N/A      2178      G   /usr/bin/ksmserver                            3MiB |
|    0   N/A  N/A      2181      G   /usr/bin/kded5                                3MiB |
|    0   N/A  N/A      2182      G   /usr/bin/kwin_x11                           148MiB |
|    0   N/A  N/A      2209      G   /usr/bin/plasmashell                         25MiB |
|    0   N/A  N/A      2265      G   ...c/polkit-kde-authentication-agent-1        3MiB |
|    0   N/A  N/A      2360      G   ...86_64-linux-gnu/libexec/kdeconnectd        3MiB |
|    0   N/A  N/A      2370      G   /usr/bin/kaccess                              3MiB |
|    0   N/A  N/A      2377      G   ...-linux-gnu/libexec/DiscoverNotifier        3MiB |
|    0   N/A  N/A      2444      G   ...AAAAAAAACAAAAAAAAAA= --shared-files        3MiB |
|    0   N/A  N/A      2456      G   /usr/bin/okular                               3MiB |
|    0   N/A  N/A      2516      G   /usr/bin/systemsettings5                     87MiB |
|    0   N/A  N/A      2550      G   /usr/lib/firefox/firefox                    153MiB |
|    0   N/A  N/A      2670      G   /usr/bin/konsole                              3MiB |
|    0   N/A  N/A      2680      G   /usr/bin/kwalletd5                            3MiB |
|    0   N/A  N/A      3335      G   /usr/lib/thunderbird/thunderbird            181MiB |
|    0   N/A  N/A     13042      G   ..._64-linux-gnu/libexec/kf5/klauncher        3MiB |
|    0   N/A  N/A     13045      G   /usr/bin/kwalletmanager5                      3MiB |
|    0   N/A  N/A     13098      G   ...86_64-linux-gnu/libexec/baloorunner        3MiB |
|    0   N/A  N/A     13191      G   ...,WinRetrieveSuggestionsOnlyOnDemand       67MiB |
|    0   N/A  N/A     14067      G   ...sion,SpareRendererForSitePerProcess      144MiB |
|    0   N/A  N/A     28664      G   /usr/bin/konsole                              3MiB |

Step 4 (Optional) - Troubleshoot nvidia docker support

In my case, I got the following error

$ docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu20.04 nvidia-smi

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.

More detailed explanation of the solution is here.

My current version

$ nvidia-container-cli --version

version: 1.3.0
build date: 2020-09-16T12:32+00:00
build revision: 16315ebdf4b9728e899f615e208b50c41d7a5d15
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

Add nvidia libnvidia container repo:

curl -fsSL | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \
  && \
    sudo apt-get update

As I already had it installed, I just upgrade:

sudo apt upgrade

I now verify my version again and see I have v.1.14.2 installed instead.

$ nvidia-container-cli --version 
cli-version: 1.14.2
lib-version: 1.14.2
build date: 2023-09-25T10:10+00:00
build revision: 1eb5a30a6ad0415550a9df632ac8832bf7e2bbba
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

I now verify nvidia is working in your containers:

$ docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu20.04 nvidia-smi     
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 3070 ...    Off | 00000000:01:00.0  On |                  N/A |
| N/A   49C    P8              18W / 125W |   1278MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |

Step 5 Start Bacalhau

docker run \
    -d --rm --name bacalhau \
    --gpus all \
    --net host \
    --env BACALHAU_ENVIRONMENT=production \
    -v /var/run/docker.sock:/var/run/docker.sock \
    -v /tmp:/tmp \
    -u root \ \
    serve \
        --ipfs-connect /dns4/localhost/tcp/5001 \
        --node-type compute \
        --private-internal-ipfs=false \
        --peer env

Step 6 Verify

Check you are running the same client and server versions:

docker exec bacalhau bacalhau --api-host=localhost version
 v1.1.1  v1.1.1  v1.1.1 

Check you are connected to the network (last 3 lines here show that we are):

$ sudo lsof -i :1235
bacalhau 9663 root    3u  IPv4 103929      0t0  TCP *:1235 (LISTEN)
bacalhau 9663 root    7u  IPv4 103933      0t0  UDP *:1235 
bacalhau 9663 root    8u  IPv6 103934      0t0  TCP *:1235 (LISTEN)
bacalhau 9663 root    9u  IPv6 103935      0t0  UDP *:1235 
bacalhau 9663 root   12u  IPv4 101126      0t0  TCP tuxedo:1235-> (ESTABLISHED)
bacalhau 9663 root   13u  IPv4  40799      0t0  TCP tuxedo:1235-> (ESTABLISHED)
bacalhau 9663 root   14u  IPv4 101157      0t0  TCP tuxedo:1235-> (ESTABLISHED) is currently a bit faulty so instead to view your current jobs its better to:

docker exec bacalhau bacalhau job list