CapaBench

A Game-Theoretic Evaluation Benchmark for Modular Attribution in LLM Agents

Yingxuan Yang1, Bo Huang1, Siyuan Qi1, Chao Feng1, Haoyi Hu1, Yuxuan Zhu2, Jinbo Hu1, Haoran Zhao1, Ziyi He3, Xiao Liu4, Zongyu Wang4, Lin Qiu4, Xuezhi Cao4, Xunliang Cai4, Yong Yu1, Weinan Zhang1
1Shanghai Jiao Tong University, 2University of Chicago, 3University of Toronto, 4Meituan

Contact: zoeyyx@sjtu.edu.cn, wnzhang@sjtu.edu.cn

Introduction

CapaBench is a novel evaluation framework leveraging Shapley Value from cooperative game theory to measure the contributions of individual modules within modular Large Language Models (LLMs). By quantifying these contributions, CapaBench facilitates systematic optimization and enhances the interpretability of agent architectures.

The framework evaluates Planning, Reasoning, Action, and Reflection capabilities by analyzing their individual and synergistic impacts on task performance. With a comprehensive dataset of over 1,500 multi-round, practical task scenarios, CapaBench enables robust and generalizable assessments tailored to real-world applications.

radar chart

Framework Overview

Agent Workflow

Agent Workflow

Key Components

Planning

Decomposes complex tasks into structured subtasks, enabling efficient resource allocation.

Reasoning

Uses logical inference and contextual understanding to determine appropriate actions.

Action

Translates cognitive processes into executable operations, ensuring effective task execution.

Reflection

Analyzes outcomes to iteratively improve performance through feedback and adjustments.

Shapley Value Illustration

Shapley Value Illustration

The Shapley Value, a cornerstone of cooperative game theory, provides a mathematically rigorous method for quantifying the marginal contributions of individual modules in an agent's architecture. This ensures fair attribution of credit based on all possible permutations of module combinations.

For example, in a task requiring Planning, Reasoning, and Action, the Shapley Value captures the unique contribution of each module and their interactions. This enables a deeper understanding of how modules work together to drive performance.

Dataset Construction

To ensure that our evaluation reflects realistic, multi-faceted application scenarios, we build a large-scale dataset of over 1,500 multi-round tasks spanning a diverse range of categories (e.g., shopping, operation system, robot control, math, and theorem proving).

These tasks integrate various capabilities such as planning, tool usage, and reflection, thereby requiring holistic agent performance rather than isolated skill assessments. Our dataset will be open-sourced in the future to support further research and development, and we are actively adding more scenarios to broaden its coverage and applicability.

Daily Activities Computation Role Control
Shopping Navigation Ticket Math ATP OS Robot
Planning Task Steps
Resource Constraints
Reasoning Logical Validation
Knowledge Inference
Action Environmental Actions
Interactive Actions
Reflection Failure Analysis

Experiment Results

Selected Results Across Datasets

The image above highlights results for selected models and datasets. Metrics for baseline models are highlighted in blue. The evaluation covers nine models across five primary tasks, showcasing notable performance variations and unique module contributions. Results marked with ‘*‘ below each dataset indicate the best-performing model combinations computed based on Shapley Value.

Experimental Results Across Datasets

Key Findings

Cross-Task Model Performance Comparison

A high-level comparison of model performance across diverse tasks reveals distinct strengths and weaknesses. Notably, Claude-3.5 outperforms other models in most categories, showing particular prowess in formal verification (e.g., Coq, Lean 4, Isabelle) and robot cooperation tasks. This advantage suggests that Claude-3.5 has a robust underlying chain-of-thought reasoning mechanism and effective multi-agent collaboration strategies—capabilities essential for tasks that demand precise logical proof structures and synchronized actions. On the other hand, open-source models like Qwen-2.5 and Mistral-8X7B exhibit moderate gains in more straightforward domains, such as shopping or basic Algebra, but underperform in cognitive-heavy tasks. Their lag in automatic theorem proving and robot cooperation implies that while these models may be adept at handling routine queries and procedural problem-solving, they lack the deeper reasoning, advanced planning, or specialized modules needed for high-stakes coordination and rigorous proof validation. Strengthening these areas—possibly through fine-tuning on specialized corpora or integrating more advanced tool usage—could help bridge the gap between open-source and proprietary models in complex, multi-stage tasks.

Module Contribution Patterns

Our findings highlight that module contributions vary according to task demands, reflecting the distinct cognitive processes involved. Specifically:

  • Tasks with High Cognitive Complexity (e.g., Online Shopping, Robot Cooperation, and OS): Reasoning and Planning play pivotal roles. Online shopping requires balancing constraints (e.g., budget and preferences) and sequencing decisions effectively. In robot cooperation, Reasoning enables dynamic information updates and efficient task distribution among agents. Operation system tasks, involving troubleshooting and resource management, rely heavily on real-time problem-solving and feedback interpretation. Across these tasks, robust Reasoning ensures logical inference and decision-making under uncertainty.
  • Tasks Requiring Precision (e.g., Math Solvers and ATP): Action is the dominant module. In math solvers, particularly geometry, precise procedural execution, such as applying theorems or constructing diagrams, outweighs strategic planning. Similarly, in formal verification tasks (e.g., Coq or Lean), strict adherence to syntactic and semantic correctness is critical. Both scenarios demand meticulous step-by-step actions to ensure reliability and prevent errors.

Low Reflection Contribution

We conclude the seemingly low contribution of the Reflection module to overall task performance through two main considerations. First, whether or not the reflection directly translates into a higher success rate does not necessarily reflect the true quality or efficacy of the reflection itself. In other words, task success alone may not be the best measure of how well the model is “thinking about” its own mistakes. Second, when the model reflects on its own errors without extra information or guidance from a more capable model, it may fail to pinpoint the actual causes behind its mistakes. As a result, the lack of deeper insights into error sources means reflection often does not generate meaningful improvements in task outcomes. Consequently, while the Reflection module is present, its practical impact on success rates remains limited.

Citation

@misc{yang2025whosmvpgametheoreticevaluation,
      title={Who's the MVP? A Game-Theoretic Evaluation Benchmark for Modular Attribution in LLM Agents}, 
      author={Yingxuan Yang and Bo Huang and Siyuan Qi and Chao Feng and Haoyi Hu and Yuxuan Zhu and Jinbo Hu and Haoran Zhao and Ziyi He and Xiao Liu and Zongyu Wang and Lin Qiu and Xuezhi Cao and Xunliang Cai and Yong Yu and Weinan Zhang},
      year={2025},
      eprint={2502.00510},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2502.00510}, 
}