The paper behind Kosmos.
An AI scientist that runs long, parallel research cycles to autonomously find and verify discoveries.
One run can coordinate 200 agents, write 42,000 lines of code, and scan 1,500 papers.
A shared world model stores facts, results, and plans so agents stay in sync.
Given a goal and dataset, it runs analyses and literature searches in parallel and updates that model.
It then proposes next tasks and repeats until it writes a report with traceable claims.
Experts judged 79.4% of statements accurate and said 20 cycles equals about 6 months of work.
Across 7 studies, it reproduced unpublished results, added causal genetics evidence, proposed a disease timing breakpoint method, and flagged a neuron aging mechanism.
It needs clean, well labeled data, can overstate interpretations, and still requires human review.
Net effect, it scales data driven discovery with clear provenance and steady context across fields.
----
Paper – arxiv. org/abs/2511.02824
Paper Title: "Kosmos: An AI Scientist for Autonomous Discovery"
📈 Edison Scientific launched Kosmos, an autonomous AI researcher that reads literature, writes and runs code, tests ideas.
Compresses 6 months of human research into about 1 day.
Kosmos uses a structured world model as shared memory that links every agent’s findings, keeping work aligned to a single objective across tens of millions of tokens.
A run reads 1,500 papers, executes 42,000 lines of analysis code, and produces a fully auditable report where every claim is traceable to code or literature.
Evaluators found 79.4% of conclusions accurate, it reproduced 3 prior human findings including absolute humidity as the key factor for perovskite solar cell efficiency and cross species neuronal connectivity rules, and it proposed 4 new leads including evidence that SOD2 may lower cardiac fibrosis in humans.
Access is through Edison’s platform at $200/run with limited free use for academics.
There are caveats since runs can chase statistically neat but irrelevant signals, longer runs raise this risk, and teams often launch multiple runs to explore different paths.
Beta users estimated 6.14 months of equivalent effort for 20 step runs, and a simple model based on reading time and analysis time predicts about 4.1 months, which suggests output scales with run depth rather than hitting a fixed ceiling.