AI & Science·
BlueskyNews

AI Invented a Disease. Scientists Want to Know What Else It Fabricates.

AI systems are making genuine scientific discoveries and fabricating plausible-sounding ones with equal fluency — and biology cannot tell them apart yet.

15 records · 4 web citations

The Same Mechanism Discovers and Invents

The bixonimania experiment did not reveal a bug in AI systems — it revealed a feature that the field has not yet learned to manage. Researchers invented a fictional disease, asked popular chatbots about it, and received confident, detailed, medically plausible responses for a condition that has never existed. The experiment, published in Nature, is unsettling precisely because it runs in parallel with genuine progress: AI deep learning systems identifying aging biomarkers , grant-funded Alzheimer's genetic target searches , and an entire ICLR workshop track dedicated to AI drug discovery . The tools producing breakthroughs and the tools producing fictional diseases are the same tools. There is no architectural switch that enables one and blocks the other.

Fabrication as Discovery's Structural Twin

The biosecurity literature has started naming what the broader scientific community is still treating as a hypothetical. A Frontiers in Microbiology review on protein design risks places generative AI squarely inside the biosecurity conversation: the same generative capability that accelerates therapeutic protein design also makes novel harmful protein design more accessible. Nature's examination of AI writing full synthetic genomes frames the same structural reality from the genomic angle — each advance toward AI-assisted life synthesis is simultaneously an advance toward capabilities that biosecurity frameworks have not caught up with. These are not separate conversations that happen to run at the same time. They are the same conversation, and the scientific community has been slow to treat them as one.

The Evaluation Problem Has No Agreed Solution

The field does not yet have a shared protocol for verifying AI-generated scientific claims at scale — and the evaluation tools being built to address that gap have their own failure modes. Research finding that simple lexical constraints cause 14 to 48 percent performance collapse in instruction-tuned models reveals that benchmark environments are sensitive to surface-level perturbations that bear no relationship to the underlying capability being measured. A model that breaks under minor rephrasing is not a reliable instrument for scientific evaluation. The National Academies convened experts specifically to examine AI security risks and define research priorities , which is institutional acknowledgment that the evaluation problem is real — but convening experts is not the same as producing a methodology. The bixonimania experiment is a clean demonstration of why methodology matters: a model that confirms a fake disease with clinical fluency cannot be caught by the model itself, by the institution deploying it, or by the patient receiving the output.

What the Literature Absorbs Before Anyone Counts

The practical stakes are already visible in clinical AI outputs. A study analyzing five AI chatbots found that nearly half of responses to health and medical queries contained errors or dangerous omissions . That figure comes from the consumer-facing tier — but the peer-reviewed tier faces the same structural exposure at a different resolution. Research that uses AI outputs as inputs, without verification steps that can catch confident fabrication, is publishing results that carry unquantified uncertainty. The labs running fastest on AI-assisted discovery are not making an error by moving quickly. They are making a bet that the fabrication rate in their specific workflow is low enough not to matter — and that bet has not been tested at the scale the literature is now absorbing it.

The Protocol That Doesn't Exist Yet

The scientific community's most urgent AI problem is not capability — the capability is already producing real results and real fictions simultaneously. The urgent problem is that no field-wide verification standard exists for distinguishing them, and the longer the literature absorbs AI-assisted findings without one, the harder retroactive auditing becomes. The researchers raising hallucination flags and the labs pushing discovery forward have both correctly identified real phenomena. What neither group has delivered is a protocol the other will adopt. The scientists who build that protocol first will determine what the next decade of AI-assisted research actually proves.

The story so far

AI's simultaneous capacity for genuine scientific discovery and confident fabrication has moved from theoretical concern to documented practice — the scientific community publishing results that depend on these outputs has no shared protocol for telling them apart, and the literature is absorbing that uncertainty now.

Frequently Asked

Why can't AI models tell the difference between a disease they invented and a real one?
Because they are not designed to distinguish the two — they are trained to generate fluent, contextually plausible text. A model confirming a fictional disease is doing exactly what it was built to do: produce coherent, confident responses that match the statistical patterns of medical language in its training data. There is no internal fact-check distinguishing documented conditions from plausible-sounding ones. The verification would have to come from outside the model, through citation checking, database lookup, or expert review — steps that most deployment contexts do not require.
What should researchers using AI tools for drug discovery or biomarker analysis do now?
Treat every AI-generated claim as a hypothesis, not a finding — including claims that confirm existing literature, since models can confidently hallucinate citations to real papers that say something different. Build verification checkpoints at the output stage, not the input stage. The bixonimania experiment shows that fluency is not a reliability signal; a wrong answer reads exactly like a right one. Independent database validation and human expert review remain the only current check that doesn't share the same failure modes as the generating model.
What is the strongest argument against treating AI fabrication and AI discovery as equally serious risks?
The strongest counter is that fabrication in consumer health chatbots and fabrication in peer-reviewed AI-assisted research are categorically different failure modes — one affects individuals asking chatbots questions, the other is caught by peer review, replication, and expert scrutiny before it shapes the literature. Researchers working in controlled lab settings with domain expertise have always filtered bad outputs. The bixonimania result doesn't change that, because no credentialed scientist publishes a disease without checking whether it exists. That counter holds — until the throughput of AI-assisted research exceeds the capacity of expert review to catch errors, which is the condition the current pace of adoption is approaching.

Methodology

This story was generated autonomously from 15 source records. An editorial model synthesizes, weights, and cites each source. No human editorial judgment was applied.

IngestAnalyzeSignalWrite
Read full methodology