9

I am running workstation with an AMD CPU (EPYC 7H12) and an Nvidia GPU (RTX 3090). The system operates on Ubuntu 20.04. When working with tensorflow, I repeatedly receive warnings just as described in this related SO question.

I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

An answer suggests to identify the GPU's PCI bus ID and then set the numa_node setting for that device to 0. In my case, the following worked. After identifying the PCI-ID using lspci | grep NVIDIA:

# 1) Identify the PCI-ID of the GPU (with domain ID)
#    In my case: PCI_ID="0000.81:00.0"
lspci -D | grep NVIDIA
# 2) Write the NUMA affinity to the device's numa_node file.
echo 0 | sudo tee -a "/sys/bus/pci/devices/<PCI_ID>/numa_node"

This is, however, only a shallow fix. Firstly, the numa_node setting is reset (to the value -1) every time the system is restarted. Secondly, the Nvidia drivers seem to ignore this flag, as nvidia-smi (Nvidia's driver management tool) still displays:

nvidia-smi topo -m
#
#       GPU0  CPU Affinity    NUMA Affinity
# GPU0     X  0-127           N/A

How can I specify the NUMA affinity for the GPU persistently? Is this a configuration of the Nvidia driver, Ubuntu or the BIOS? I know that the Linux Kernel is NUMA aware, but I found it difficult to find resources on how to configure this.


Update: I've added a crontab as root, which fixes the problem more persistently. The fix remains "shallow", however, as the Nvidia driver is unaware of this.

sudo crontab -e
# Add the following line
@reboot (echo 0 | tee -a "/sys/bus/pci/devices/<PCI_ID>/numa_node")
normanius
  • 191

0 Answers0