OpenRelay API Documentation

Hardware requirements, NVIDIA GPU passthrough, node health checks, and the full lifecycle of your provider hardware.

This page covers what a provider box needs, how GPU passthrough is set up, how to confirm a node is healthy, and how nodes behave over their lifetime. To bring your first node online, start with Providers — onboarding is one command. Everything here is handled for you by that command; it's documented so you know what's happening and how to operate the box afterwards.

Hardware requirements

Requirement	Detail
OS	Ubuntu 22.04 or 24.04 LTS, with root / `sudo`.
Network	Outbound HTTPS (443) only — the node dials out, so no inbound ports are required.
Tools	`curl` (preinstalled); the installer pulls `jq`, `tar`, `unzip`, `pciutils` itself.
CPU virtualization	Intel VT-x / AMD SVM — required to run VMs (`/dev/kvm`).
IOMMU	Intel VT-d / AMD AMD-Vi — required for GPU passthrough nodes.
GPU	NVIDIA (or AMD) for GPU capacity. CPU-only nodes are also welcome.
RAM	32 GB minimum, 64 GB+ recommended — each VM needs system RAM.
Disk	NVMe with headroom for base VM images. The health check warns below 50 GB free and fails below 20 GB.

Two ways to serve GPUs

A GPU node runs customer VMs with direct GPU passthrough (QEMU + VFIO) by default. A container-GPU mode (gVisor + nvproxy) is also available — pass gpu=1 to the bootstrap. See Providers → Options.

NVIDIA GPU passthrough

The one-command install configures passthrough for you — you don't hand-edit GRUB or modprobe files. On a GPU box it:

Detects your NVIDIA/AMD GPUs with lspci (works even with no driver installed).
If a host NVIDIA driver stack is preinstalled (common on datacenter images — driver + fabricmanager + DCGM, e.g. HGX H100 SXM boxes), the installer converts the box automatically: it stops and disables the NVIDIA host services, blacklists the driver stack, and unbinds the GPUs. When IOMMU is already active this happens live with no reboot; otherwise the standard reboot + re-run finishes it. (Keep the host driver deliberately with NODE_KEEP_HOST_DRIVER=1 — the node then sells no VM GPUs.)
If they aren't passthrough-ready, enables IOMMU on the kernel command line, binds each GPU (and its IOMMU-group siblings) to vfio-pci, and installs a boot-time service (vectorlay-vfio-bind.service) so the binding survives reboots.
Asks you to reboot and re-run the exact same command. The node is intentionally not enrolled until passthrough actually works — we never advertise a GPU that can't run a VM.

If, after a reboot, the GPUs still aren't bound, enable virtualization + IOMMU in your BIOS/UEFI, then reboot and re-run:

Intel — enable VT-x and VT-d.
AMD — enable SVM and IOMMU (set to Enabled, not Auto).

A passthrough GPU leaves the local console

Once a GPU is bound for passthrough it no longer drives a local display. Manage the box over SSH, serial, or IPMI — not a monitor. Re-run the command promptly after rebooting (within ~an hour) so your enrollment token is still valid.

Multi-GPU nodes

The unit of allocation is the IOMMU group: GPUs that share a group are always passed to a VM together. The installer binds every NVIDIA/AMD GPU on the host to vfio-pci. For full-node (all-GPU) allocations, NVSwitches are passed through too so NVLink works — this is automatic.

On SXM/NVLink platforms (HGX H100/B200) the NVSwitch fabric devices are bound to vfio-pci alongside the GPUs and recorded at enrollment. A VM that rents the whole node gets the NVSwitches passed through and NVLink enabled in-guest (fabricmanager is set up automatically); partial allocations run with NVLink disabled — the fabric is shared hardware and is never split between tenants.

GPU runtime modes

Passthrough (QEMU + VFIO) — the default for GPU boxes with no host driver. The VM gets bare-metal GPU performance.
Container GPU (gVisor + nvproxy) — pass gpu=1 to the bootstrap. GPUs stay bound to the host NVIDIA driver and are shared via nvproxy; no IOMMU/VFIO needed.

Check node health

After install — or any time — run the built-in doctor:

sudo vectorlay-node-doctor

It prints PASS / FAIL per check and a final verdict:

Agents & tunnel — nebula, nomad, and node-agent running; auto-update timer enabled.
Virtualization — /dev/kvm present.
GPU passthrough — IOMMU active and per-IOMMU-group VFIO readiness, ending with N/M GPU unit(s) ready to rent.
Disk — enough free space for VM base images.

A green doctor means the control plane will advertise the node's GPU units.

If a unit shows WEDGED, a GPU fell off the bus (the D3cold reset bug) — power-cycle the host (a soft reboot may not clear it).
If a unit isn't bound to vfio-pci, run sudo systemctl restart vectorlay-vfio-bind or re-run the bootstrap.

Node lifecycle

Two credentials (don't conflate them)

	Provisioning token (`vtk_…`)	Durable identity
What	One-time bootstrap key	Data-plane mTLS cert + durable node token
Lifetime	1 hour, single-use	Lives as long as the node
On disk?	No — passed to the installer, used, discarded	Yes — under `/etc/vectorlay/`
Used for	Enrollment only	Every gateway tunnel + node-agent self-update

Everything after enrollment runs on the durable identity — the provisioning token expiring an hour later is by design and harmless.

Reboots & power loss

systemd brings nebula, nomad, and node-agent back at boot; they read their credentials from disk and rejoin automatically. On GPU nodes the VFIO bind service re-binds the GPUs before Nomad starts. No action needed.

Automatic updates

The node-agent keeps itself current via a jittered systemd timer (~hourly + up to 20 min), authenticated with the durable node token. Updates drain in-flight HTTP/SSH first, install atomically, and roll back on failure — live sessions survive.

Opt out per node: enroll with NODE_AGENT_AUTOUPDATE=0, or sudo systemctl disable --now node-agent-update.timer.
Force a check now: sudo systemctl start node-agent-update.service (logs: journalctl -u node-agent-update).

Recycle / re-enroll

Re-running the onboarding one-liner on an existing box is safe and idempotent: the node keeps its identity (same node ID and overlay IP), its token is rotated, no quota is consumed, and nothing referencing the node (clusters / VMs) breaks. Use it to repair a node or roll it onto new config.

Take a node offline

To temporarily remove a node from the network, stop its agents:

sudo systemctl stop nomad nebula node-agent

Nomad stops scheduling to it and the data-plane tunnel closes (in-flight work drains first). Start the services again — or reboot — to bring it back.

Decommission a node

To retire a node permanently, open the Nodes view in the dashboard, click Decommission on the node's row, and confirm. That is the whole flow, and it needs no credentials beyond your dashboard login.

Both terminal paths below need an org API key (vl_…) with the clusters:write scope, created on the Account → API Keys page. The provisioning token you installed the node with (vtk_…) is a different credential and is rejected here.

With the CLI (install), log in once and every later command reuses the key. A key is bound to one org, so orl reads your org from it and you never pass an org id:

orl auth login    # one time, paste your vl_… key
orl provider node update $NODE_ID --status removed

Calling the API directly is the only path that needs your org id in the URL (shown under your org name in the dashboard sidebar, with a copy button):

curl -sS -X PATCH -H "Authorization: Bearer $VL_KEY" \
  -H "Content-Type: application/json" -d '{"status":"removed"}' \
  https://api.openrelay.inc/v1/orgs/$VL_ORG/provider/nodes/$NODE_ID

Reading curl failures

Use -sS, not -fsSL. With -f, curl throws the response body away and prints only an exit code, which turns an ordinary 401 (wrong credential) into a silent failure with nothing to act on. -sS prints the JSON error and its code.

Any of the three frees the node's workloads and GPU units. A decommission isn't a one-way door: re-running the onboarding one-liner on a removed node auto-recovers it.

Clean offboard (decommission the hardware)

The steps above retire the node in the control plane. Clearing the node-local state is a separate, deeper step you only need when you're decommissioning the hardware for good or moving the box to a different environment and re-onboarding it.

A node's durable identity, Nomad client state, overlay certificates, and config are written to local disk at enrollment. If a box is re-onboarded — or pointed at a different environment — without clearing this, the fresh enrollment collides with the stale identity: the node's Nomad client is rejected and it gets stuck in pending, never coming online. A clean offboard wipes that node-local state so a new enrollment provisions cleanly.

This removes the node from the fleet

Stopping the agents and clearing state terminates any VMs and workloads currently running on the node. If those must be preserved, drain or migrate them first. Run every command below as root.

Stop the VectorLay agents

sudo systemctl stop nomad nebula node-agent node-agent-update.timer

Clear Nomad client state, config, and TLS material

The stale Nomad client identity is the main thing that blocks re-onboarding.

sudo rm -rf /opt/nomad/data
sudo rm -rf /etc/nomad /etc/nomad.d
sudo rm -f  /etc/systemd/system/nomad.service.d/intro.conf

Clear the overlay (Nebula) identity

sudo rm -rf /etc/nebula

Clear the VectorLay node identity

The durable node token, data-plane certificates, and config.

sudo rm -rf /etc/vectorlay/dataplane
sudo rm -f  /etc/vectorlay/node.token /etc/vectorlay/node-agent-update.env /etc/vectorlay/node.env

(Optional) Remove per-VM overlays and seed data

Hygiene only. Base images can be left in place — they're checksum-verified and reused.

sudo rm -rf /var/lib/vectorlay/vms/* /var/lib/vectorlay/seed/*

Reload systemd

sudo systemctl daemon-reload

Which steps are essential

Steps 1–4 are required — leaving any of that stale identity or client state on disk is what causes the stuck-pending problem on re-onboarding. Step 5 is optional hygiene. If you're doing a full OS reimage instead, the fresh image is already a clean slate, so steps 1–5 are unnecessary.

Re-onboarding after a clean offboard

Once the node-local state is cleared, re-onboard the box with the standard one-command installer using a fresh provisioning token — the old token is single-use and short-lived:

curl -fsSL "https://api.openrelay.inc/v1/provider/bootstrap?token=<provisioning-token>&nodeType=persistent" | sudo bash

Add &gpu=1 for GPU nodes. See Providers for how to mint a token and the full set of bootstrap options.

Troubleshooting

Symptom	Cause / fix
`403 NOT_A_PROVIDER`	The org isn't an approved provider yet.
`401`	Bad, missing, or expired API key / provisioning token.
Install prints nothing ("0 logs")	`curl -fsSL … \| sudo bash` swallows HTTP errors, so an empty script runs. Re-run with `curl -sS -w 'HTTP %{http_code}\n' …` to see the real status.
`checksum mismatch` / `502 ARTIFACT`	A stale artifact — re-run; artifacts are re-fetched and verified each run.
GPUs not bound to `vfio-pci` after reboot	Enable VT-d / IOMMU in BIOS; `sudo systemctl restart vectorlay-vfio-bind`; reboot and re-run.
A GPU unit shows WEDGED	The GPU fell off the bus (D3cold reset bug) — power-cycle the host.

See Providers for onboarding and Authentication for API keys.

Provider nodes