Nvidia eyes LPU stacking for Feynman inference grab

Published in Graphics

Nvidia eyes LPU stacking for Feynman inference grab

by Nick Farrell on29 December 2025

font size decrease font size increase font size

Hybrid bonding, SRAM dies, and a possible CUDA headache

Nvidia wants to own inference, and word on the street is that it is lining up its Feynman GPUs to do it.

The dark satanic rumour mill has spun a hell-on-earth yarn claiming that Nvidia could integrate LPU units into next-gen Feynman GPUs, using an IP licensing deal for Groq’s LPU tech as the entry point.

GPU expert AGF reckons the LPUs could be stacked on Feynman using TSMC’s hybrid bonding, a move aimed at stuffing more low-latency memory close to compute.

The comparison is AMD’s X3D play, where extra cache gets bonded on top, except the “extra” looks like LPU dies packed with SRAM banks.

AGF argues that building SRAM as a monolithic block on leading-edge nodes makes little sense because SRAM scaling is limited, and it would burn up pricey wafer area for minimal gain.

Instead, the idea is a main Feynman compute die on something like A16 (1.6nm) handling tensor blocks and control logic, with separate LPU dies carrying the SRAM.

Wider hybrid-bonded links would do the joining, promising a fat interface and lower energy per bit than off-package memory, which sounds lovely on a slide.

If A16 really comes with backside power delivery, that frees up the front side for vertical SRAM connections, pushing latency down where inference actually hurts.

The neat-looking diagram doing the rounds shows TSVs, vertical SRAM connections, LPU dies as SRAM banks and a hybrid bonding interface, all stacked as if it were easy.

But it is not going to be easy.

Stacking dies on a high-density compute part drags thermals into the fight, and LPUs chasing sustained throughput can turn a clever package into a throttling experiment.

Then there is the execution model problem: LPUs lean into a fixed execution order, which rubs against the flexibility people expect from GPU workflows.

Even if the hardware works, the software may still sulk.

CUDA likes abstraction and kernels that do not care where every byte lives, while LPU-style execution pushes explicit memory placement and determinism as first-class citizens.

Getting a mixed LPU-GPU environment to behave inside CUDA looks like the sort of “engineering marvel” that eats schedules, budgets and careers.

Last modified on 29 December 2025

Rate this item

(0 votes)

Tagged under