I am running workstation with an AMD CPU (EPYC 7H12) and an Nvidia GPU (RTX 3090). The system operates on Ubuntu 20.04. When working with tensorflow, I repeatedly receive warnings just as described in this related SO question.
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
An answer suggests to identify the GPU's PCI bus ID and then set the numa_node setting for that device to 0. In my case, the following worked. After identifying the PCI-ID using lspci | grep NVIDIA:
# 1) Identify the PCI-ID of the GPU (with domain ID)
# In my case: PCI_ID="0000.81:00.0"
lspci -D | grep NVIDIA
# 2) Write the NUMA affinity to the device's numa_node file.
echo 0 | sudo tee -a "/sys/bus/pci/devices/<PCI_ID>/numa_node"
This is, however, only a shallow fix. Firstly, the numa_node setting is reset (to the value -1) every time the system is restarted. Secondly, the Nvidia drivers seem to ignore this flag, as nvidia-smi (Nvidia's driver management tool) still displays:
nvidia-smi topo -m
#
# GPU0 CPU Affinity NUMA Affinity
# GPU0 X 0-127 N/A
How can I specify the NUMA affinity for the GPU persistently? Is this a configuration of the Nvidia driver, Ubuntu or the BIOS? I know that the Linux Kernel is NUMA aware, but I found it difficult to find resources on how to configure this.
Update: I've added a crontab as root, which fixes the problem more persistently. The fix remains "shallow", however, as the Nvidia driver is unaware of this.
sudo crontab -e
# Add the following line
@reboot (echo 0 | tee -a "/sys/bus/pci/devices/<PCI_ID>/numa_node")