Provider nodes
Hardware requirements, NVIDIA GPU passthrough, node health checks, and the full lifecycle of your provider hardware.
This page covers what a provider box needs, how GPU passthrough is set up, how to confirm a node is healthy, and how nodes behave over their lifetime. To bring your first node online, start with Providers — onboarding is one command. Everything here is handled for you by that command; it's documented so you know what's happening and how to operate the box afterwards.
Hardware requirements
| Requirement | Detail |
|---|---|
| OS | Ubuntu 22.04 or 24.04 LTS, with root / sudo. |
| Network | Outbound HTTPS (443) only — the node dials out, so no inbound ports are required. |
| Tools | curl (preinstalled); the installer pulls jq, tar, unzip, pciutils itself. |
| CPU virtualization | Intel VT-x / AMD SVM — required to run VMs (/dev/kvm). |
| IOMMU | Intel VT-d / AMD AMD-Vi — required for GPU passthrough nodes. |
| GPU | NVIDIA (or AMD) for GPU capacity. CPU-only nodes are also welcome. |
| RAM | 32 GB minimum, 64 GB+ recommended — each VM needs system RAM. |
| Disk | NVMe with headroom for base VM images. The health check warns below 50 GB free and fails below 20 GB. |
Two ways to serve GPUs
A GPU node runs customer VMs with direct GPU passthrough (QEMU + VFIO) by
default. A container-GPU mode (gVisor + nvproxy) is also available — pass
gpu=1 to the bootstrap. See Providers → Options.
NVIDIA GPU passthrough
The one-command install configures passthrough for you — you don't hand-edit GRUB or modprobe files. On a GPU box it:
- Detects your NVIDIA/AMD GPUs with
lspci(works even with no driver installed). - If they aren't passthrough-ready, enables IOMMU on the kernel command line,
binds each GPU (and its IOMMU-group siblings) to
vfio-pci, and installs a boot-time service (vectorlay-vfio-bind.service) so the binding survives reboots. - Asks you to reboot and re-run the exact same command. The node is intentionally not enrolled until passthrough actually works — we never advertise a GPU that can't run a VM.
If, after a reboot, the GPUs still aren't bound, enable virtualization + IOMMU in your BIOS/UEFI, then reboot and re-run:
- Intel — enable VT-x and VT-d.
- AMD — enable SVM and IOMMU (set to Enabled, not Auto).
A passthrough GPU leaves the local console
Once a GPU is bound for passthrough it no longer drives a local display. Manage the box over SSH, serial, or IPMI — not a monitor. Re-run the command promptly after rebooting (within ~an hour) so your enrollment token is still valid.
Multi-GPU nodes
The unit of allocation is the IOMMU group: GPUs that share a group are always
passed to a VM together. The installer binds every NVIDIA/AMD GPU on the host
to vfio-pci. For full-node (all-GPU) allocations, NVSwitches are passed through
too so NVLink works — this is automatic.
GPU runtime modes
- Passthrough (QEMU + VFIO) — the default for GPU boxes with no host driver. The VM gets bare-metal GPU performance.
- Container GPU (gVisor + nvproxy) — pass
gpu=1to the bootstrap. GPUs stay bound to the host NVIDIA driver and are shared via nvproxy; no IOMMU/VFIO needed.
Check node health
After install — or any time — run the built-in doctor:
sudo vectorlay-node-doctorIt prints PASS / FAIL per check and a final verdict:
- Agents & tunnel —
nebula,nomad, andnode-agentrunning; auto-update timer enabled. - Virtualization —
/dev/kvmpresent. - GPU passthrough — IOMMU active and per-IOMMU-group VFIO readiness, ending with
N/M GPU unit(s) ready to rent. - Disk — enough free space for VM base images.
A green doctor means the control plane will advertise the node's GPU units.
- If a unit shows WEDGED, a GPU fell off the bus (the D3cold reset bug) — power-cycle the host (a soft reboot may not clear it).
- If a unit isn't bound to
vfio-pci, runsudo systemctl restart vectorlay-vfio-bindor re-run the bootstrap.
Node lifecycle
Two credentials (don't conflate them)
Provisioning token (vtk_…) | Durable identity | |
|---|---|---|
| What | One-time bootstrap key | Data-plane mTLS cert + durable node token |
| Lifetime | 1 hour, single-use | Lives as long as the node |
| On disk? | No — passed to the installer, used, discarded | Yes — under /etc/vectorlay/ |
| Used for | Enrollment only | Every gateway tunnel + node-agent self-update |
Everything after enrollment runs on the durable identity — the provisioning token expiring an hour later is by design and harmless.
Reboots & power loss
systemd brings nebula, nomad, and node-agent back at boot; they read their
credentials from disk and rejoin automatically. On GPU nodes the VFIO bind service
re-binds the GPUs before Nomad starts. No action needed.
Automatic updates
The node-agent keeps itself current via a jittered systemd timer (~hourly + up
to 20 min), authenticated with the durable node token. Updates drain in-flight
HTTP/SSH first, install atomically, and roll back on failure — live sessions
survive.
- Opt out per node: enroll with
NODE_AGENT_AUTOUPDATE=0, orsudo systemctl disable --now node-agent-update.timer. - Force a check now:
sudo systemctl start node-agent-update.service(logs:journalctl -u node-agent-update).
Recycle / re-enroll
Re-running the onboarding one-liner on an existing box is safe and idempotent: the node keeps its identity (same node ID and overlay IP), its token is rotated, no quota is consumed, and nothing referencing the node (clusters / VMs) breaks. Use it to repair a node or roll it onto new config.
Take a node offline
To temporarily remove a node from the network, stop its agents:
sudo systemctl stop nomad nebula node-agentNomad stops scheduling to it and the data-plane tunnel closes (in-flight work drains first). Start the services again — or reboot — to bring it back.
Decommission a node
To retire a node permanently, set its status to removed from the Nodes view
in the dashboard, or via the API:
curl -fsSL -X PATCH -H "Authorization: Bearer $VL_KEY" \
-H "Content-Type: application/json" -d '{"status":"removed"}' \
https://api.openrelay.inc/v1/orgs/$VL_ORG/provider/nodes/$NODE_IDThis frees the node's workloads and GPU units. A decommission isn't a one-way door: re-running the onboarding one-liner on a removed node auto-recovers it.
Troubleshooting
| Symptom | Cause / fix |
|---|---|
403 NOT_A_PROVIDER | The org isn't an approved provider yet. |
401 | Bad, missing, or expired API key / provisioning token. |
| Install prints nothing ("0 logs") | curl -fsSL … | sudo bash swallows HTTP errors, so an empty script runs. Re-run with curl -sS -w 'HTTP %{http_code}\n' … to see the real status. |
checksum mismatch / 502 ARTIFACT | A stale artifact — re-run; artifacts are re-fetched and verified each run. |
GPUs not bound to vfio-pci after reboot | Enable VT-d / IOMMU in BIOS; sudo systemctl restart vectorlay-vfio-bind; reboot and re-run. |
| A GPU unit shows WEDGED | The GPU fell off the bus (D3cold reset bug) — power-cycle the host. |
See Providers for onboarding and Authentication for API keys.