CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents

1KAUST, 2Eigent.AI, 3UTokyo, 4CMU, 5Stanford, 6Harvard, 7Tsinghua, 8SUSTech, 9Oxford
*Equal Contribution Corresponding Author

CRAB aims to become a general-purpose agent benchmark framework for Multimodal Language Model (MLM) agents. CRAB provides an end-to-end while easy-to-use framework to build agents, operate environments, and create benchmarks to evaluate them, featuring three key components: cross-environment support, a graph evaluator, and task generation. We present CRAB Benchmark-v0, developed using the CRAB framework, which includes 100 tasks across 2 environments (Ubuntu and Android), tested with 4 different MLMs under 3 distinct communication settings.

Features

  • Cross-environments: CRAB supports multiple environments, ensuring that agents can seamlessly adapt and excel across different interfaces.
  • Graph evaluatior: With fine-grained evaluation, CRAB goes beyond binary success rates to provide a detailed analysis of agent performance, highlighting their strengths and pinpointing areas for improvement.
  • Task Generation: CRAB automates task creation using a graph-based method. By combining multiple sub-tasks into complex tasks, CRAB generates dynamic tasks that closely mimic real-world scenarios, saving time and reducing the effort required for manual task creation.
  • Easy-to-use: All agent operations (actions), observations, and benchmark evaluators are defined by Python functions. Therefore, adding a new environment to CRAB requires only a few lines of Python code. The benchmark configuration follows the declarative programming paradigm, making it really easy to reproduce any experiment environment.

Demo Videos

Open "Slack" in Ubuntu, navigate to "multi-modal-benchmark" channel, summarize the last two messages, then use "Messages" app in android phone to send the to the first contact in the list.

Settings: OpenAI GPT-4o + Multi-agent by Functionality


Open "Tasks" app on Android, check the first incomplete task, then perform the task according to its description

Settings: OpenAI GPT-4o + Multi-agent by Functionality


Open "Calendar" app on Android, summarize all schedules today. Then, in Ubuntu, create a markdown file at "/home/crab/assets/plan.md" with each event as a checkbox bullet point using Terminal and Vim.

Settings: OpenAI GPT-4o + Single Agent


Please open the X app on my phone, search for CAMEL-AI.org, check the latest post, summarize them, and then send the summary to Tianqi Xu on Slack from my PC.

Settings: OpenAI GPT-4o + Single Agent

Demo videos are edited for a better viewing experience. In actual execution, there are tens of seconds of waiting time between each step.

Related Works

We compare CRAB with existing GUI agents and benchmarks.

The columns detail key features of each framework:

  • Interactive Environment indicates the presence of either interactive environments or static datasets.
  • Multimodal Observation specifies the availability of vision-based observations (e.g. screenshots).
  • Cross-platform denotes support for multiple operating systems or platforms.
  • Evaluation describes the evaluation metrics, categorized as:
    • Goal-based: checking environment state according solely on the final goal.
    • Trajectory-based: comparing agent action trajectory with a gold actions sequence.
    • Multiple: varied across tasks.
    • Graph-based: a DAG with each node as an intermediate checkpoint.
  • Task Construction shows the task construction method, including:
    • Handmade: handcrafted by human.
    • LLM-inspired: using LLM to generate task drafts but still verified and annotated by human.
    • Template: generated by filling in the blanks in task templates.
    • Sub-task Composition: composing multiple sub-tasks to construct tasks and evaluators.

Experiment Results

  1. Diverse Performance Across Models: Pro
    • GPT-4o has the highest success rate and completion ratio overall.
    • GPT-4 TURBO demonstrates improved cost efficiency (CE) compared to other models.
    • Gemini 1.5 Pro and Claude 3 Opus struggle more, hardly complete the tasks.
  2. Completion Ratios Reflect True Agent Capability:
    • Completion Ratio (CR), which is computed by the graph evaluator, provides a detailed view of performance. GPT-4o achieves a CR of in single-agent mode, outperforming Gemini 1.5 Pro and Claude 3 Opus.
  3. Efficiency Metrics Highlight Strengths and Weaknesses:
    • GPT-4 TURBO excels in cost efficiency in single-agent mode, demonstrating cost-effective performance.
    • GPT-4o balances efficiency with performance, especially in single-agent mode.
    • Gemini 1.5 Pro shows lower efficiency and incomplete CE metrics, mainly caused by its low completion ratio.
  4. Termination Reasons Indicate Areas for Improvement:
    • High Reach Step Limit (RSL) percentages across all models suggest agents often run out of steps and don't achieve the final goals of the tasks.
    • Invalid Action (IA) rates are high for Gemini 1.5 Pro, highlighting issues in generating the action following the correct format.
    • False Completion (FC) rates in multi-agent settings are higher than those in single-agent settings, indicating that message loss occurs during communication among multiple agents.

BibTeX

@misc{xu2024crab,
      title={CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents}, 
      author={Tianqi Xu and Linyao Chen and Dai-Jie Wu and Yanjun Chen and Zecheng Zhang and Xiang Yao and Zhiqiang Xie and Yongchao Chen and Shilong Liu and Bochen Qian and Philip Torr and Bernard Ghanem and Guohao Li},
      year={2024},
      eprint={2407.01511},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2407.01511}, 
}