The “AI Diagnostic Orchestrator” is the first public effort from Vole's new health AI unit, cooked up by Microsoft AI chief Mustafa Suleyman [pictured] and filled with staff nicked from DeepMind, the outfit he co-founded before Google bought it.
Suleyman told the Financial Times this is a step towards “medical superintelligence” that could tackle the chronic underfunding and burnout wrecking healthcare systems around the globe.
The orchestrator works by setting up a virtual roundtable of five AI agents, each pretending to be a specialist doctor with roles like forming hypotheses or picking tests. These synthetic medics debate each case before settling on a diagnosis.
To see what it could do, the system was handed 304 diagnostic puzzles from the New England Journal of Medicine, the kind that stump clinicians. The AI then used a new technique called “chain of debate” to show its reasoning in human-readable steps.
Microsoft fed it large language models from the usual suspects, OpenAI, Meta, Anthropic, Google, xAI and DeepSeek. The orchestrator improved all of them, but OpenAI’s o3 reasoning model was the winner, solving 85.5 per cent of cases.
For comparison, experienced doctors only managed 20 per cent, although in fairness they weren’t allowed textbooks or even to consult colleagues during the trial.
Microsoft is now eyeing ways to bake this wizardry into Copilot and Bing, which handle around 50 million health queries daily.
Suleyman reckons we're close to “AI models that are not just a little bit better, but dramatically better, than human performance: faster, cheaper and four times more accurate.”
“That is going to be truly transformative,” he added.
Microsoft has sunk nearly $14 billion into OpenAI and has exclusive rights to flog its tech, though relations have been testy as the startup tries to go full for-profit.
Despite OpenAI performing best, Suleyman claimed Microsoft is “agnostic” about which LLMs it uses, insisting it’s the orchestration that sets their platform apart.
Former DeepMind health boss Dominic King, now with Vole, said the system “performed better than anything we’ve ever seen before” and could act as a new “front door to healthcare”.
King noted that the AI was programmed to be tight-fisted, slashing unnecessary tests and saving hundreds of thousands of dollars during the trial.
But he was quick to pump the brakes, stressing the tech isn’t ready for hospitals and hasn’t been peer reviewed.
Cardiologist Eric Topol called it a “landmark study”, noting this was the first serious glimpse at how generative AI might combine efficiency with accuracy and cost savings in medicine, even if it’s still miles away from actual clinical use.