CloudRift, a GPU cloud provider for developers, was the first to sound the alarm. It reported that after just a few days of VM duty, RTX 5090 and RTX Pro 6000 cards became completely unresponsive. Once that happens, neither host nor client can access the GPU again without a full reboot, which is a nightmare when you’re juggling thousands of guest machines.
The bug appears to be linked to the VFIO device driver and Function Level Reset (FLR). After an FLR, the GPU refuses to come back to life, triggering a kernel soft lock and dragging both host and client into deadlock territory.
Curiously, older models like the RTX 4090, Hopper H100s and even Blackwell B200s are unaffected. That points to the issue being specific to the newer flagship Blackwell designs.
CloudRift are not the only ones reporting the mess. A Proxmox user reported a full host crash after shutting down a Windows client, which looks suspiciously similar. According to him, Nvidia has admitted it can reproduce the fault and is working on a fix.
In the meantime, CloudRift has slapped a $1,000 (€930) bug bounty on the table for anyone who can patch or work around the problem. Nvidia will have to sort this quickly, because the crashes are hitting workloads that the company has been shouting about as Blackwell’s sweet spot: AI training and inference at scale.