TL;DR. Today we're releasing Medmarks v1.0, a major expansion of our open-source evaluation suite for medical LLMs. It now covers 30 benchmarks, 61 models across 71 configurations, and spans verifiable tasks, open-ended clinical reasoning, and agentic workflows. Gemini 3 Pro Preview tops the verifiable leaderboard; GPT-5.2 tops the open-ended one. The full paper is on arXiv. Newer models will be added soon, as Medmarks will serve as a live leaderboard.
Today, we release a new version of our medical evaluation suite, including a preprint on arXiv and updated code. This v1.0 release is a significant upgrade in scale, reliability, and scope:

Figure 1: Results on Medmarks-V and Medmarks-OE for subset of models evaluated on both benchmarks.
The rapid deployment and adoption of LLMs in clinical practice has outpaced the evidence around their capabilities, in large part due to the difficulty of creating reliable and exhaustive evaluations. Multiple approaches have been proposed to solve this problem, but they all suffer from limitations that prevent model developers from actually using these evaluations during development. For instance, MedHELM introduced by Stanford
Our goal with Medmarks is to offer a test environment that is simultaneously (1) fully open, with no gated datasets, (2) broad enough to cover realistic clinical tasks beyond multiple choice, (3) large enough in model coverage to support meaningful cross-model comparisons, and (4) structured so the same infrastructure supports both evaluation and post-training.
Rankings use a weighted mean win rate. Switch tabs between the verifiable subset (Medmarks-V) and the open-ended subset (Medmarks-OE).
| Model ↕ | Size ↕ | Win rate (%) ↓ |
|---|
It's commonly debated whether general-purpose LLMs are sufficient for medical use

Figure 2: Win-rate change between medical fine-tunes and their base models.
While the largest open models we tested, like GLM 4.7, are approaching the frontier on the mainly multiple-choice question Medmarks-V benchmark, the proprietary–open gap reappears in the harder Medmarks-OE benchmark, which contains open-ended medical questions and agentic workflows.
Furthermore, open models have another Achilles heel: efficiency. To illustrate, GLM 4.7 and GPT-5.2 in Medmarks-V are almost identical, but GLM 4.7 requires over 5× the number of tokens to get there. We also see notable differences between proprietary models: Grok 4 and Gemini 3 Pro Preview use larger reasoning token budgets than GPT models, with Grok 4 costing roughly an order of magnitude more per query than GPT-5.1.
These findings reveal a major limitation for open-weight model deployment in clinical settings, where time and cost are of the essence.

Figure 3: A scatter plot of mean win rate on Medmarks-V by tokens for top 12 models evaluated.
OpenAI's gpt-oss models expose reasoning effort as a low/medium/high dial, which lets us test the effect of more reasoning tokens in a controlled setting. Increasing reasoning effort produces an almost Pareto improvement across datasets, with PubMedQA
These findings are interesting given that we also observe that models tend to "overthink" questions they fail to answer correctly. Since accuracy improves with reasoning budget overall, it appears that models naturally reason longer on harder questions, but eventually still fail to answer some of them appropriately.

Figure 4: Win-rate change between gpt-oss reasoning level.
We ran three rollouts of nearly every multiple-choice benchmark: one with the original order and two with shuffled answer positions. The variance reveals that multiple-choice answer order still affects modern LLMs, including frontier ones.
The most striking case is Grok 4 on M-ARC
The Medbullets

Figure 5: Comparing model performance with and without an extra option on the Medbullets
For v1.0 we ran MedCalc-Bench
Some models improve substantially with tools — MiniMax M2, Qwen3 VL, Mirothinker 1.5 30B, Olmo 3 7B. But many regress, and the failure modes are informative:
The takeaway: tool use isn't a free capability upgrade. The combination of tool templates, chat-completion replay semantics, and instruction-following can introduce new failure modes that offset the benefits of actually having a calculator.

Figure 6: MedCalcBench with and without tools.
One of our goals with Medmarks was to build infrastructure that supports both evaluation and post-training from the same codebase. The seven benchmarks with train/test splits — MedQA
For v1.0 we ran a very preliminary RL post-training demonstration on Qwen-3-4B-Instruct-0725 across three datasets with different reward formulations: MedCalc-Bench-Verified (calculation verifier), MedMCQA (multiple-choice matching), and MedCaseReasoning (LLM-as-a-Judge). All three show clear learning curves over 560 training steps on 8 H100s.

Figure 7: Test accuracy and training reward for Qwen-3-4B-Instruct-0725 trained on MedCalc-Bench-Verified, MedMCQA, and MedCaseReasoning over the course of training for 560 steps.
There are many more results in the main paper and earlier blog post, check it out!
We hope our benchmark suite brings us closer to real-world assessment of LLM medical capabilities in a more reproducible and accessible manner. We will continue to add new models to the benchmark suite (if you’re a model developer, please get in touch with us!) to evaluate the progress of medical capabilities in LLMs.
We are also exploring medical-specific post-training to further improve the performance of open-source LLMs. Medmarks-T is just a starting point, we are planning to construct various datasets/environments and experimenting with different post-training methods. If you are interested in such research, be sure to join https://medarc.ai and contribute!
Thanks to FAL AI for providing compute that supported this research. Thanks to Prime Intellect for providing API inference credits. Thanks to the MedARC Discord community for being the public forum from which this research was developed.
For attribution in academic contexts, please cite the arXiv paper:
Benjamin Warner, Ratna Sagari Grandhi, Max Kieffer, Aymane Ouraq, Saurav Panigrahi, Geetu Ambwani, Kunal Bagga, Nikhil Khandekar, Arya Hariharan, Nishant Mishra, Manish Ram, Shamus Sim Zi Yang, Ahmed Essouaied, Adepoju Jeremiah Moyondafoluwa, Robert Scholz, Bofeng Huang, Molly Beavers, Srishti Gureja, Anish Mahishi, Sameed Khan, Maxime Griot, Hunar Batra, Jean-Benoit Delbrouck, Siddhant Bharadwaj, Ronald Clark, Ashish Vashist, Anas Zafar, Leema Krishna Murali, Harsh Deshpande, Ameen Patel, William Brown, Johannes Hagemann, Connor Lane, Paul Steven Scotti, and Tanishq Mathew Abraham. "Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks." arXiv preprint arXiv:2605.01417, 2026.
BibTeX citation:
@article{warner2026medmarks,
title={Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks},
author={Warner, Benjamin and Grandhi, Ratna Sagari and Kieffer, Max and Ouraq, Aymane and Panigrahi, Saurav and Ambwani, Geetu and Bagga, Kunal and Khandekar, Nikhil and Hariharan, Arya and Mishra, Nishant and Ram, Manish and Sim Zi Yang, Shamus and Essouaied, Ahmed and Moyondafoluwa, Adepoju Jeremiah and Scholz, Robert and Huang, Bofeng and Beavers, Molly and Gureja, Srishti and Mahishi, Anish and Khan, Sameed and Griot, Maxime and Batra, Hunar and Delbrouck, Jean-Benoit and Bharadwaj, Siddhant and Clark, Ronald and Vashist, Ashish and Zafar, Anas and Murali, Leema Krishna and Deshpande, Harsh and Patel, Ameen and Brown, William and Hagemann, Johannes and Lane, Connor and Scotti, Paul Steven and Abraham, Tanishq Mathew},
journal={arXiv preprint},
eprint={2605.01417},
archivePrefix={arXiv},
year={2026},
url={https://arxiv.org/abs/2605.01417}
}