Medical VLM post-training Pass@K diagnostics Boundary-aware RL

When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains

MedBridgeRL studies when RL genuinely helps medical vision-language models: first measure support with Pass@K, then decide whether to bridge weak regimes with targeted supervised data or sharpen already-supported behavior with RL.

Ahmadreza Jeddi^1,2,3,*, Kimia Shaban^1,2,3,*, Negin Baghbanzadeh^2,4,*, Natasha Sharan¹, Abhishek Moturu^1,2,3, Elham Dolatabadi^2,4, Babak Taati^1,2,3

¹University of Toronto • ²Vector Institute • ³University Health Network • ⁴York University, Canada

Paper Code

Models

Framework at a glance

Measure support first, then choose the right post-training move

Assess support

Use Pass@K to estimate the model's latent capability rather than judging it only by greedy decoding.

Compare to Acc@1

The gap between support and default behavior reveals whether the model lacks capability or just samples poorly.

Bridge weak regimes

When support is low, targeted medical SFT expands what the model can already reach.

Sharpen with RL

When support is already present, RL improves sampling efficiency and pushes good answers toward the top.

Overview

What MedBridgeRL is asking

Does RL improve reasoning?

Medical RL post-training often helps on some benchmarks but can fail to transfer, or even hurt latent capability, across others.

Support vs. behavior

Greedy accuracy alone hides whether a model knows the answer somewhere in its distribution or never reaches it at all.

A practical recipe

The paper turns this diagnosis into a simple decision rule: bridge with SFT when support is weak, prioritize RL when support is sufficient.

Reinforcement learning is increasingly used to post-train medical vision-language models, yet it is still unclear when it truly helps. A gain in greedy accuracy might reflect better reasoning, better sampling, or simply better alignment to a narrow benchmark.

MedBridgeRL separates these effects through the lens of Pass@K versus Acc@1. Support tells us whether correct answers are already present in the model's distribution, while default behavior tells us whether the model can surface them reliably.

This view leads to a boundary-aware strategy: use task- or modality-proximal supervision to expand support when it is low; once support is already non-trivial, use RL to sharpen the distribution and improve top-1 performance and sampling efficiency.

Framework

From measurement to decision

Step 1: quantify support

MedBridgeRL starts by asking a sharper question than “what is the greedy answer?” It samples multiple reasoning traces, measures whether the correct answer appears anywhere among them, and treats that as Pass@K support.

High Pass@K, low Acc@1: the model often knows the answer but does not reliably choose it first.
Low Pass@K: the answer is rarely in the distribution at all, so RL has little to sharpen.
Practical implication: measure before optimizing, or post-training may target the wrong bottleneck.

MedBridgeRL measurement panel showing Pass@K and Acc@1 — Support is estimated with sampled traces and compared against greedy behavior, revealing whether the limitation is capability or sampling.

Boundary-aware recipe

Bridge first, then sharpen

MedBridgeRL thresholding support and choosing between SFT and RL — The thresholded support view turns RL post-training into a regime-selection problem instead of a one-size-fits-all recipe.

Step 2: choose the right intervention

Once support is measured, the next move becomes much clearer. If the model has weak support, targeted medical SFT can move probability mass toward useful regions. If support is already high, RL is better used to concentrate that mass and improve Acc@1.

Bridge support with SFT: expand what the model can reach.
Sharpen support with RL: make strong candidates easier to surface.
End result: a more interpretable and more transferable post-training recipe for medical VLMs.

When support is low

RL alone is unlikely to rescue the model. First use supervision that is close to the task or modality to raise Pass@K.

When support is already present

RL becomes effective because it sharpens the distribution, improving top-1 accuracy and sampling efficiency without needing to invent new capability from scratch.

Citation

BibTeX

@misc{jeddi2026doesrlhelpmedical,
  title={When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains},
  author={Ahmadreza Jeddi and Kimia Shaban and Negin Baghbanzadeh and Natasha Sharan and Abhishek Moturu and Elham Dolatabadi and Babak Taati},
  year={2026},
  eprint={2603.01301},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2603.01301}
}