CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents

Tianqi Xu^1,2*, Linyao Chen^3*, Dai-Jie Wu^1*, Yanjun Chen^4*, Zecheng Zhang, Xiang Yao², Zhiqiang Xie⁵, Yongchao Chen⁶, Shilong Liu⁷, Bochen Qian⁸, Anjie Yang², Zhaoxuan Jin⁹, Jianbo Deng, Philip Torr¹⁰, Bernard Ghanem^1†, Guohao Li^2,10†

¹KAUST, ²Eigent.AI, ³UTokyo, ⁴CMU, ⁵Stanford, ⁶Harvard, ⁷Tsinghua, ⁸SUSTech, ⁸Northwestern, ¹⁰Oxford

^*Equal Contribution ^†Corresponding Author

Paper arXiv Code CRAB Benchmark-v0

CRAB aims to become a general-purpose agent benchmark framework for Multimodal Language Model (MLM) agents. CRAB provides an end-to-end while easy-to-use framework to build agents, operate environments, and create benchmarks to evaluate them, featuring three key components: cross-environment support, a graph evaluator, and task generation. We present CRAB Benchmark-v0, developed using the CRAB framework, which includes 120 tasks across 2 environments (Ubuntu and Android), tested with 6 different MLMs under 3 distinct communication settings.

Leaderboard

Organization	Model	Completion Ratio (%) ↑	Success Rate (%)	Parameter Size	Visual Prompt	Structure Output Type	Release Date
OpenAI	gpt-4o-2024-05-13	38.01	14.17	Unknown	SoM (GroundingDino+EasyOCR)	Tool calling API	2024-10
OpenAI	gpt-4-turbo-2024-04-09	33.35	9.17	Unknown	SoM (GroundingDino+EasyOCR)	Tool calling API	2024-10
OpenAI	gpt-4o-2024-05-13	23.05	9.17	Unknown	SoM (GroundingDino+EasyOCR)	model-generated JSON	2024-10
Anthropic	claude-3-opus-20240229	19.60	3.33	Unknown	SoM (GroundingDino+EasyOCR)	Tool calling API	2024-10
Google	Gemini 1.5 PRO	15.48	3.33	Unknown	SoM (GroundingDino+EasyOCR)	Tool calling API	2024-10
Mistral AI	Pixtral-12B-2409	9.50	0.83	12B	SoM (GroundingDino+EasyOCR)	model-generated JSON	2024-10
ByteDance	llava-onevision-qwen2-72b-ov-chat	6.64	0.83	72B	SoM (GroundingDino+EasyOCR)	model-generated JSON	2024-10

The results are based on the CRAB Benchmark v0, released on 2024-10-18, with a total of 120 tasks.

Features

Cross-environments: CRAB supports multiple environments, ensuring that agents can seamlessly adapt and excel across different interfaces.
Graph evaluatior: With fine-grained evaluation, CRAB goes beyond binary success rates to provide a detailed analysis of agent performance, highlighting their strengths and pinpointing areas for improvement.
Task Generation: CRAB automates task creation using a graph-based method. By combining multiple sub-tasks into complex tasks, CRAB generates dynamic tasks that closely mimic real-world scenarios, saving time and reducing the effort required for manual task creation.
Easy-to-use: All agent operations (actions), observations, and benchmark evaluators are defined by Python functions. Therefore, adding a new environment to CRAB requires only a few lines of Python code. The benchmark configuration follows the declarative programming paradigm, making it really easy to reproduce any experiment environment.

Demo Videos

Open "Slack" in Ubuntu, navigate to "multi-modal-benchmark" channel, summarize the last two messages, then use "Messages" app in android phone to send the to the first contact in the list.

Settings: OpenAI GPT-4o + Multi-agent by Functionality

Open "Tasks" app on Android, check the first incomplete task, then perform the task according to its description

Settings: OpenAI GPT-4o + Multi-agent by Functionality

Open "Calendar" app on Android, summarize all schedules today. Then, in Ubuntu, create a markdown file at "/home/crab/assets/plan.md" with each event as a checkbox bullet point using Terminal and Vim.

Settings: OpenAI GPT-4o + Single Agent

Please open the X app on my phone, search for CAMEL-AI.org, check the latest post, summarize them, and then send the summary to Tianqi Xu on Slack from my PC.

Settings: OpenAI GPT-4o + Single Agent

Demo videos are edited for a better viewing experience. In actual execution, there are tens of seconds of waiting time between each step.

Related Works

We compare CRAB with existing GUI agents and benchmarks.

The columns detail key features of each framework:

Interactive Environment indicates the presence of either interactive environments or static datasets.
Multimodal Observation specifies the availability of vision-based observations (e.g. screenshots).
Cross-platform denotes support for multiple operating systems or platforms.
Evaluation describes the evaluation metrics, categorized as:
- Goal-based: checking environment state according solely on the final goal.
- Trajectory-based: comparing agent action trajectory with a gold actions sequence.
- Multiple: varied across tasks.
- Graph-based: a DAG with each node as an intermediate checkpoint.
Task Construction shows the task construction method, including:
- Handmade: handcrafted by human.
- LLM-inspired: using LLM to generate task drafts but still verified and annotated by human.
- Template: generated by filling in the blanks in task templates.
- Sub-task Composition: composing multiple sub-tasks to construct tasks and evaluators.

Experiment Results

Diverse Performance Across Models
- GPT-4o has the highest success rate and completion ratio overall.
- Gemini 1.5 Pro and Claude 3 Opus struggle more, hardly complete the tasks.
- Considering the models of JSON output, GPT-4o's performance drops compared to the function-calling-enabled version, primarily due to its higher Invalid Action rate.
- In open source models, Pixtral-12B, with far fewer parameters, achieves a better CR compared to LLaVA-ov-72B, showcasing its efficiency. Although the open-source models generally understand screenshots and generate step-by-step plans correctly, they often fail to execute the correct actions according to the plan.
Completion Ratios Reflect True Agent Capability
- Completion Ratio (CR), which is computed by the graph evaluator, provides a detailed view of performance.
- Even though GPT-4o with single agent strcuture and with mutli-agent by environment structure have the same success rates, their completion ratios differ by up to 4.67%.
Termination Reasons Indicate Areas for Improvement
- High Reach Step Limit (RSL) percentages across all models suggest agents often run out of steps and don't achieve the final goals of the tasks.
- Invalid Action (IA) rates are high for Gemini 1.5 Pro, highlighting issues in generating the action following the correct format.
- False Completion (FC) rates in multi-agent settings are higher than those in single-agent settings, indicating that message loss occurs during communication among multiple agents.

BibTeX

@misc{xu2024crab,
      title={CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents}, 
      author={Tianqi Xu and Linyao Chen and Dai-Jie Wu and Yanjun Chen and Zecheng Zhang and Xiang Yao and Zhiqiang Xie and Yongchao Chen and Shilong Liu and Anjie Yang and Zhaoxuan Jin and Jianbo Deng and Bochen Qian and Philip Torr and Bernard Ghanem and Guohao Li},
      year={2024},
      eprint={2407.01511},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2407.01511}, 
}