FusionAgent: A Multimodal Agent with Dynamic Model Selection for Human Recognition

TL;DR: FusionAgent leverages a Multimodal LLM agent to dynamically select the optimal model combination per sample, achieving robust and explainable human recognition via adaptive score fusion.

Abstract

Existing whole-body human recognition systems usually fuse face, gait, and body models with a fixed strategy for all samples. This static design is inefficient and can hurt accuracy when some modalities are unreliable (e.g., missing face cues in back-view videos).

We propose FusionAgent, an MLLM-based framework that dynamically selects model subsets per sample and fuses their outputs with Anchor-based Confidence Top-k (ACT) score fusion. Trained with reinforcement fine-tuning under Group Relative Policy Optimization (GRPO) and a metric-based reward, FusionAgent learns explainable, sample-specific model selection without requiring ground-truth reasoning traces or exhaustive search over model combinations.

Across CCVID, LTCC, and MEVID, FusionAgent consistently improves open-set search and verification performance, while also providing a practical trade-off between interpretability and speed through Chain-of-Thought (CoT) and Direct Answering (DA) inference modes.

Why Dynamic Model Selection?

Existing score-fusion methods — both rule-based and learning-based — assume that all models provide complementary information, treating the model combination as fixed for every input. This can be suboptimal: for instance, when a video only captures the back view of a person, incorporating face recognition models would be inappropriate and may even degrade performance.

FusionAgent addresses this by leveraging an MLLM agent to dynamically select a subset of models per sample, followed by adaptive score fusion, enabling robust integration tailored to each input.

Method

The FusionAgent framework consists of two core components: (1) an MLLM-based agent that performs multi-turn reasoning and dynamic model selection using a ReAct-style controller, and (2) the ACT score-fusion algorithm that robustly integrates outputs from the selected models. The agent is trained via Group Relative Policy Optimization (GRPO) with four reward functions: format reward, tool success reward, answer accuracy reward, and a novel metric-based reward.

For each query, the agent first analyzes the biometric evidence and selects an initial model. Each tool returns both a predicted identity and an internal score vector over the gallery; the score vector is retained for fusion, while the predicted identity is exposed to the agent for subsequent reasoning. The agent can then continue invoking tools or terminate with a final answer and reasoning summary, yielding transparent, traceable decisions.

**Overview of the FusionAgent framework.** Recognition models are wrapped as tools to generate score vectors and predicted identities. The MLLM agent receives multimodal biometric inputs, performs reasoning-action steps through multi-turn dialogue, selectively invokes tools, and integrates predictions into a final identity decision and fused score vector. The agent is optimized with reinforcement fine-tuning using rule-based rewards, including the proposed metric-based reward.

Anchor-based Confidence Top-k (ACT) Score-Fusion

ACT dynamically combines scores from multiple models by leveraging a stable anchor model (the first model selected by the agent) to provide a robust score vector, while selectively incorporating normalized, high-confidence scores from complementary models. The anchor model provides a global ranking structure, while selected models offer sparse, localized refinements only for their top-k predictions.

This design reduces the misalignment introduced by heterogeneous embeddings and sample-wise model selection. In practice, ACT is especially effective for difficult verification and open-set search settings, where suppressing noisy non-match scores is as important as improving the top match.

ACT score-fusion example — **A toy example of ACT score-fusion.** Three models are used with the FR model as the anchor and k=1. ACT amplifies the gap between match and non-match scores through confidence-based top-k and anchor weighting, improving verification and open-set search performance.

Experimental Results

FusionAgent is evaluated on three challenging whole-body biometric benchmarks: CCVID, LTCC, and MEVID. The results consistently demonstrate superior performance across all metrics, especially in verification (TAR@FAR) and open-set search (FNIR@FPIR).

Compared with statistical fusion baselines and the quality-aware fusion method QME, FusionAgent achieves the most consistent gains in FNIR, showing that adaptive model selection is particularly beneficial for robust open-set retrieval. The paper also reports that DA substantially reduces latency relative to CoT, while CoT provides more interpretable reasoning, and that the agent transfers well to cross-domain LTCC evaluation with zero-shot and 10-shot adaptation.

Performance on **CCVID**
Method	Rank-1 ↑	mAP ↑	TAR@1%FAR ↑	FNIR@1%FPIR ↓
AdaFace	94.0	87.9	75.7	13.0 ± 3.5
CAL	81.4	74.7	66.3	52.8 ± 13.3
BigGait	76.7	61.0	49.7	71.1 ± 6.1
SapiensID	92.6	77.8	-	-
Min-Fusion	87.1	79.2	62.4	48.5 ± 8.7
Max-Fusion	89.9	89.3	73.4	23.0 ± 10.1
Z-score	92.2	90.6	73.9	15.1 ± 1.5
Min-max	91.8	90.9	73.9	15.4 ± 2.5
Weighted-sum	91.7	90.6	73.6	15.4 ± 1.8
Asym-AO1	92.3	90.0	74.0	15.9 ± 1.7
BSSF	91.8	91.1	73.9	14.1 ± 1.3
Farsight	92.0	91.2	73.9	13.9 ± 1.1
QME	94.1	90.8	76.2	12.3 ± 1.4
FusionAgent (DA)	92.8	92.2	85.8	10.5 ± 1.5
FusionAgent (CoT)	93.4	92.6	85.9	10.1 ± 1.5

Performance on **LTCC**
Method	Rank-1 ↑	mAP ↑	TAR@1%FAR ↑	FNIR@1%FPIR ↓
AdaFace	18.5	5.9	2.4	99.8 ± 0.2
CAL	74.4	40.6	36.7	59.7 ± 7.3
AIM	74.8	40.9	37.0	66.2 ± 7.5
SapiensID	72.0	34.6	-	-
Min-Fusion	38.1	13.5	12.4	81.9 ± 6.0
Max-Fusion	62.5	33.3	16.8	94.8 ± 4.7
Z-score	73.0	37.5	30.4	68.7 ± 9.2
Min-max	73.2	38.1	31.9	75.1 ± 9.2
Weighted-sum	73.2	37.8	31.3	72.4 ± 8.6
Asym-AO1	71.2	32.9	19.1	76.3 ± 8.9
BSSF	73.5	39.1	34.2	68.9 ± 8.5
Farsight	73.2	37.8	31.3	72.4 ± 8.6
QME	73.8	39.6	35.0	64.3 ± 8.0
FusionAgent (DA)	75.5	41.0	36.5	50.3 ± 9.0
FusionAgent (CoT)	75.5	41.0	37.0	50.0 ± 8.5

Performance on **MEVID**
Method	Rank-1 ↑	mAP ↑	TAR@1%FAR ↑	FNIR@1%FPIR ↓
AdaFace	25.0	8.1	5.4	98.8 ± 1.2
CAL	52.5	27.1	34.7	67.8 ± 7.3
AGRL	51.9	25.5	30.7	69.4 ± 8.9
Min-Fusion	46.8	21.2	28.0	70.4 ± 8.0
Max-Fusion	33.2	14.9	8.3	97.4 ± 1.6
Z-score	54.1	27.4	30.7	66.5 ± 7.0
Min-max	52.8	24.7	25.0	71.3 ± 6.1
Weighted-sum	54.1	27.3	30.3	66.3 ± 7.0
Asym-AO1	52.5	22.9	23.6	71.7 ± 5.8
BSSF	53.5	27.4	30.5	65.9 ± 7.2
Farsight	53.8	25.4	26.6	69.8 ± 6.4
QME	55.7	28.2	32.9	64.6 ± 8.2
FusionAgent (DA)	52.5	28.7	34.8	60.8 ± 7.3
FusionAgent (CoT)	54.7	28.7	34.9	58.6 ± 7.4

Analysis

Score Distribution Comparison

FusionAgent achieves a markedly larger margin between match scores and the FAR threshold, while effectively suppressing non-match scores near zero.

Dynamic Model Selection Statistics

The agent adapts its model selection per dataset: on CCVID (clear faces), it anchors on the FR model; on LTCC/MEVID (surveillance), it relies more on ReID models.

The ablation study in the paper shows that the performance gain does not come from using more models indiscriminately. Compared with hard selection that fuses all models, agent-based dynamic selection consistently performs better across Rank-1, mAP, and TAR, while substantially reducing FNIR. Additional experiments also show that ACT remains effective when paired with the agent-selected model subsets, outperforming alternative fusion rules such as Z-score and Farsight on LTCC.

Effect of Top-k Value in ACT

FusionAgent consistently outperforms baselines across all top-k values, maintaining both higher performance and greater stability compared to hard selection (using all models).

BibTeX

@inproceedings{zhu2026fusionagent,
  title     = {FusionAgent: A Multimodal Agent with Dynamic Model Selection for Human Recognition},
  author    = {Zhu, Jie and Guo, Xiao and Su, Yiyang and Jain, Anil and Liu, Xiaoming},
  booktitle = {CVPR},
  year      = {2026},
}

FusionAgent

A Multimodal Agent with Dynamic Model Selection for Human Recognition