CRAB: Cross-environment Agent Benchmark
CRAB is a benchmark framework designed for evaluating multimodal language model (MLM) agents across diverse environments. It offers tools for building agents, operating environments, and creating benchmarks.
Key Features:
- Cross-environment Support: Agents can adapt and perform seamlessly across different interfaces like Ubuntu and Android.
- Graph Evaluator: Provides fine-grained performance analysis beyond simple success rates.
- Task Generation: Automates task creation using a graph-based method, mimicking real-world scenarios.
- Easy-to-Use: Python-based definitions for agent operations, observations, and benchmark evaluators.
Use Cases:
- Evaluating and comparing the performance of different MLM agents.
- Developing agents that can operate across multiple platforms.
- Analyzing agent strengths and weaknesses through detailed performance metrics.
- Automating the creation of complex, real-world tasks for agent training and evaluation.
CRAB Benchmark-v0 includes 120 tasks across Ubuntu and Android environments, tested with 6 MLMs under 3 communication settings.