SAFE: Multitask Failure Detection for
Vision-Language-Action Models

In Submission

1University of Toronto (UofT), 2UofT Robotics Institute 3Vector Institute 4Toyota Research Institute (TRI)

We introduce the multitask failure detection problem for VLA models, and propose SAFE, a failure detector that can detect failures for unseen tasks zero-shot and achieve state-of-the-art performance.

Abstract

While vision-language-action models (VLAs) have shown promising robotic behaviors across a diverse set of manipulation tasks, they achieve limited success rates when deployed on novel tasks out-of-the-box. To allow these policies to safely interact with their environments, we need a failure detector that gives a timely alert such that the robot can stop, backtrack, or ask for help. However, existing failure detectors are trained and tested only on one or a few specific tasks, while VLAs require the detector to generalize and detect failures also in unseen tasks and novel environments. In this paper, we introduce the multitask failure detection problem and propose SAFE, a failure detector for generalist robot policies such as VLAs. We analyze the VLA feature space and find that VLAs have sufficient high-level knowledge about task success and failure, which is generic across different tasks. Based on this insight, we design SAFE to learn from VLA internal features and predict a single scalar indicating the likelihood of task failure. SAFE is trained on both successful and failed rollouts, and is evaluated on unseen tasks. SAFE is compatible with different policy architectures. We test it on OpenVLA, π0, and π0-FAST in both simulated and real-world environments extensively. We compare SAFE with diverse baselines and show that SAFE achieves state-of-the-art failure detection performance and the best trade-off between accuracy and detection time using conformal prediction.

VLA Latent Feature Analysis

We find that the VLA's internal features capture high-level information about task success and failure, and such information is general across different tasks. As shown in the figure below, when a VLA is failing, even though from different tasks, the features fall in the same failure zone. This motivates SAFE, an efficient multitask failure detector that is based on VLA internal features and can generalize to unseen tasks.


SAFE: Multitask Failure Detector for VLA Models

Based on the above observation, we propose SAFE, a failure detector that learns from VLA internal features and predicts a single scalar indicating the likelihood of task failure. SAFE has 3 main components:

  • Feature Extraction: SAFE extracts the latent feature from the last layer of a VLA model. In experiments, we ablate different ways of extracting features and aggregate them into a single feature vector.
  • Learning Failure Detector: SAFE sequentially processes the latent feature and predicts a failure score, using an MLP or an LSTM backbone. These models are of 1 or 2 layers to reduce overfitting and improve generalization.
  • Calibration and Deployment: SAFE determines a time-varying threshold using functional conformal prediction (CP) on a hold-out calibration set. If the predicted score exceeds the threshold during testing, SAFE confidently detects a failure.

Experiments

We evaluate the following diverse baselines. All the baselines use the same conformal prediction framework as SAFE to determine the time-varying threshold.

  • Token Uncertainty: Failure scores are computed based on token-wise uncertainty (probability and entropy).
  • Embedding Distribution: Failure scores are computed based on the embedding distances to the calibration distribution.
  • Sample Consistency: Multiple actions are sampled and failure scores are the inconsistency among the samples.
  • Action Consistency: We adopt STAC scores and also STAC-single that only uses a single sample per timestep.

We conduct experiments on OpenVLA, π0 and π0-FAST VLA models on LIBERO, SimplerEnv benchmarks and a real-world Franka robot.

How well do failure detectors distinguish failures from successes?

Following the LLM uncertainty quantification literature, we report the area under the ROC curve (ROC-AUC) metric in the following figure. ROC-AUC results are computed based on the max predicted failure score in each rollout. This metric averages the performance over all possible thresholds, reflecting the overall failure detection performance regardless of threshold selection. In the following table, the best and second best results are highlighted in red and orange, respectively.

How do detection accuracy and detection time trade off using functional CP?

By varying the significance level α used in functional conformal prediction (CP), we can control the conservativeness of failure detection, which gives a trade-off between detection accuracy and detection time. In the following figure, we plot the balanced accuracy ((TPR + TNR)/2) against the average detection time for different α values.

Are the detected failures aligned with human intuition?

In the following, we show a few example successful and failed rollouts together with the failure scores predicted by SAFE. The green shaded region indicates the failure detection threshold determined by functional conformal prediction. Video frames with red border mean the a failure alert has been raised at that time step.

Success

Failure: When the robot gets stuck while picking up alphabet soup, it raises a failure signal

Success

Failure: The robot gets stuck in its initial state.

Success

Failure: The robot knocks down the tomato sauce and fails grasping, subsequently exhibits dangerous behavior.

Success

Failure: When the robot attempts to place the bowl, it exhibits unexpected dangerous behavior.

Success

Failure: The robot misses the insertion attempt and subsequently exhibits unstable behavior.

Success

Failure: The robot gets stuck while attempting to grasp, triggering a failure signal.

Success

Failure: The robot repeatedly fails to grasp the carrot.

Success: Note that failure scores stop increasing after the robot finishes the task.

Failure: The robot gets stuck while picking up the handle of the lid.

Success: Note that failure scores stop increasing after the robot finishes the task.

Failure: The robot repeatedly fails to grasp the carrot despite multiple attempts.

BibTeX

@article{gu2025safe,
  author    = {Gu, Qiao and Ju, Yuanliang and Sun, Shengxiang and Gilitschenski, Igor and Nishimura, Haruki and Itkina, Masha and Shkurti, Florian},
  title     = {SAFE: Multitask Failure Detection for Vision-Language-Action Models},
  journal   = {arXiv preprint arXiv:2506.09937},
  year      = {2025},
}