CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents

Tianqi Xu1,2*, Linyao Chen3*, Dai-Jie Wu1*, Yanjun Chen4*, Zecheng Zhang, Xiang Yao2, Zhiqiang Xie5, Yongchao Chen6, Shilong Liu7, Bochen Qian8, Anjie Yang2, Zhaoxuan Jin9, Jianbo Deng, Philip Torr10, Bernard Ghanem1†, Guohao Li2,10†
1KAUST, 2Eigent.AI, 3UTokyo, 4CMU, 5Stanford, 6Harvard, 7Tsinghua, 8SUSTech, 8Northwestern, 10Oxford
*Equal Contribution Corresponding Author

CRAB aims to become a general-purpose agent benchmark framework for Multimodal Language Model (MLM) agents. CRAB provides an end-to-end while easy-to-use framework to build agents, operate environments, and create benchmarks to evaluate them, featuring three key components: cross-environment support, a graph evaluator, and task generation. We present CRAB Benchmark-v0, developed using the CRAB framework, which includes 120 tasks across 2 environments (Ubuntu and Android), tested with 6 different MLMs under 3 distinct communication settings.

Leaderboard

Organization Model Completion Ratio (%) ↑ Success Rate (%) Parameter Size Visual Prompt Structure Output Type Release Date
OpenAI gpt-4o-2024-05-13 38.01 14.17 Unknown SoM (GroundingDino+EasyOCR) Tool calling API 2024-10
OpenAI gpt-4-turbo-2024-04-09 33.35 9.17 Unknown SoM (GroundingDino+EasyOCR) Tool calling API 2024-10
OpenAI gpt-4o-2024-05-13 23.05 9.17 Unknown SoM (GroundingDino+EasyOCR) model-generated JSON 2024-10
Anthropic claude-3-opus-20240229 19.60 3.33 Unknown SoM (GroundingDino+EasyOCR) Tool calling API 2024-10
Google Gemini 1.5 PRO 15.48 3.33 Unknown SoM (GroundingDino+EasyOCR) Tool calling API 2024-10
Mistral AI Pixtral-12B-2409 9.50 0.83 12B SoM (GroundingDino+EasyOCR) model-generated JSON 2024-10
ByteDance llava-onevision-qwen2-72b-ov-chat 6.64 0.83 72B SoM (GroundingDino+EasyOCR) model-generated JSON 2024-10

The results are based on the CRAB Benchmark v0, released on 2024-10-18, with a total of 120 tasks.

Features

  • Cross-environments: CRAB supports multiple environments, ensuring that agents can seamlessly adapt and excel across different interfaces.
  • Graph evaluatior: With fine-grained evaluation, CRAB goes beyond binary success rates to provide a detailed analysis of agent performance, highlighting their strengths and pinpointing areas for improvement.
  • Task Generation: CRAB automates task creation using a graph-based method. By combining multiple sub-tasks into complex tasks, CRAB generates dynamic tasks that closely mimic real-world scenarios, saving time and reducing the effort required for manual task creation.
  • Easy-to-use: All agent operations (actions), observations, and benchmark evaluators are defined by Python functions. Therefore, adding a new environment to CRAB requires only a few lines of Python code. The benchmark configuration follows the declarative programming paradigm, making it really easy to reproduce any experiment environment.

Demo Videos

Open "Slack" in Ubuntu, navigate to "multi-modal-benchmark" channel, summarize the last two messages, then use "Messages" app in android phone to send the to the first contact in the list.

Settings: OpenAI GPT-4o + Multi-agent by Functionality


Open "Tasks" app on Android, check the first incomplete task, then perform the task according to its description

Settings: OpenAI GPT-4o + Multi-agent by Functionality


Open "Calendar" app on Android, summarize all schedules today. Then, in Ubuntu, create a markdown file at "/home/crab/assets/plan.md" with each event as a checkbox bullet point using Terminal and Vim.

Settings: OpenAI GPT-4o + Single Agent


Please open the X app on my phone, search for CAMEL-AI.org, check the latest post, summarize them, and then send the summary to Tianqi Xu on Slack from my PC.

Settings: OpenAI GPT-4o + Single Agent

Demo videos are edited for a better viewing experience. In actual execution, there are tens of seconds of waiting time between each step.

Related Works

We compare CRAB with existing GUI agents and benchmarks.

The columns detail key features of each framework:

  • Interactive Environment indicates the presence of either interactive environments or static datasets.
  • Multimodal Observation specifies the availability of vision-based observations (e.g. screenshots).
  • Cross-platform denotes support for multiple operating systems or platforms.
  • Evaluation describes the evaluation metrics, categorized as:
    • Goal-based: checking environment state according solely on the final goal.
    • Trajectory-based: comparing agent action trajectory with a gold actions sequence.
    • Multiple: varied across tasks.
    • Graph-based: a DAG with each node as an intermediate checkpoint.
  • Task Construction shows the task construction method, including:
    • Handmade: handcrafted by human.
    • LLM-inspired: using LLM to generate task drafts but still verified and annotated by human.
    • Template: generated by filling in the blanks in task templates.
    • Sub-task Composition: composing multiple sub-tasks to construct tasks and evaluators.

Experiment Results

  • Diverse Performance Across Models
    • GPT-4o has the highest success rate and completion ratio overall.
    • Gemini 1.5 Pro and Claude 3 Opus struggle more, hardly complete the tasks.
    • Considering the models of JSON output, GPT-4o's performance drops compared to the function-calling-enabled version, primarily due to its higher Invalid Action rate.
    • In open source models, Pixtral-12B, with far fewer parameters, achieves a better CR compared to LLaVA-ov-72B, showcasing its efficiency. Although the open-source models generally understand screenshots and generate step-by-step plans correctly, they often fail to execute the correct actions according to the plan.
  • Completion Ratios Reflect True Agent Capability
    • Completion Ratio (CR), which is computed by the graph evaluator, provides a detailed view of performance.
    • Even though GPT-4o with single agent strcuture and with mutli-agent by environment structure have the same success rates, their completion ratios differ by up to 4.67%.
  • Termination Reasons Indicate Areas for Improvement
    • High Reach Step Limit (RSL) percentages across all models suggest agents often run out of steps and don't achieve the final goals of the tasks.
    • Invalid Action (IA) rates are high for Gemini 1.5 Pro, highlighting issues in generating the action following the correct format.
    • False Completion (FC) rates in multi-agent settings are higher than those in single-agent settings, indicating that message loss occurs during communication among multiple agents.

BibTeX

@misc{xu2024crab,
      title={CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents}, 
      author={Tianqi Xu and Linyao Chen and Dai-Jie Wu and Yanjun Chen and Zecheng Zhang and Xiang Yao and Zhiqiang Xie and Yongchao Chen and Shilong Liu and Anjie Yang and Zhaoxuan Jin and Jianbo Deng and Bochen Qian and Philip Torr and Bernard Ghanem and Guohao Li},
      year={2024},
      eprint={2407.01511},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2407.01511}, 
}