Overview of SAFER's calibration and test-time process. In Stage I, we derive a statistically valid minimum sample budget \(\hat{s}\) that can strictly control the test-time risk of the candidate set of size \(\hat{s}\) not covering correct answers. In Stage II, we employ the calibration instances, which can obtain admissible answers within \(\hat{s}\) samples, to calibrate a threshold \(\hat{t}\). This threshold filters out unreliable answers in the candidate set while still constraining the miscoverage risk of the final prediction set.
As large language models (LLMs) are increasingly deployed in risk-sensitive applications such as real-world open-ended question answering (QA), ensuring the trustworthiness of their outputs has become critical. Existing selective conformal prediction (SCP) methods provide statistical guarantees by constructing prediction sets with a constrained miscoverage rate for correct answers. However, prior works unrealistically assume that admissible answers for all instances can be obtained via finite sampling, even for open-ended QA scenarios that lack a fixed and finite solution space. To address this, we introduce a two-stage risk control framework comprising SAbstention-aware Sampling and Filtering (SAFER). Firstly, on a held-out calibration set, SAFER calibrates a sampling budget within the maximum sampling cap, using the Clopper-Pearson exact method at a user-desired risk level. If the risk level cannot be satisfied within the cap, we abstain; otherwise, the calibrated sampling budget becomes the minimum requirements at test time. Then, we employ calibration instances where correct answers are attainable under the calibrated budget and apply the conformal risk control method to determine a statistically valid uncertainty threshold, which filters unreliable distractors from the candidate set for each test data point. We evaluate SAFER on three free-form QA datasets utilizing five popular LLMs, and demonstrate that it rigorously constrains two-stage miscoverage risks at test time.
Find minimum budget \(\hat{s}\) satisfying user-specified risk \(\alpha\). We abstain this risk level if \(\hat{R}^+(M) > \alpha\). Here \(M\) stands for the max sampling budget. $$ \hat{s} = \inf\{s \in [1, M]: \hat{R}^+(s) \le \alpha \}. $$
Apply conformal risk control on calibration instances to determine uncertainty threshold \(\hat{t}\) that filters unreliable distractors.
Generate \(\hat{s}\) candidates and filter using threshold \(\hat{t}\) to form final prediction set with rigorous risk control under high propbability.
Empirical miscoverage rates vs. sampling budget. Dashed lines: system upper bounds (Clopper-Pearson). Solid lines: test-time empirical rates.
Comparison of TRON and SAFER for test-time EER control in sampling. SAFER achieves lower risk levels.
Test-time EER results in the filtering stage on TriviaQA (α = 0.05) and CoQA (α = 0.25) at various risk levels (β).
Comparison of calibrated sampling budget vs. prediction set size after filtering. Filtering compresses budgets into tighter sets while maintaining risk control.
@inproceedings{
wang2026safer,
title={{SAFER}: Risk-Constrained Sample-then-Filter in Large Language Models},
author={Qingni Wang and Yue Fan and Xin Eric Wang},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=kJmLmOvwLC}
}