AgentBench

Agent evaluation platform

A benchmark suite for evaluating the ability of large language models to operate as autonomous agents in various environments

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

GitHub

2k stars
28 watching
159 forks
Language: Python
last commit: 9 days ago
Linked from 1 awesome list

chatgptgpt-4llmllm-agent

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
agenta-ai/agenta A developer platform for building and deploying large language models 1,275
opengvlab/lamm A framework and benchmark for training and evaluating multi-modal large language models, enabling the development of AI agents capable of seamless interaction between humans and machines. 301
damo-nlp-sg/m3exam A benchmark for evaluating large language models in multiple languages and formats 92
tianyi-lab/hallusionbench An image-context reasoning benchmark designed to challenge large vision-language models and help improve their accuracy 243
melih-unsal/demogpt A comprehensive toolset for building Large Language Model (LLM) based applications 1,710
qcri/llmebench A benchmarking framework for large language models 80
mpaepper/llm_agents Builds agents controlled by large language models (LLMs) to perform tasks with tool-based components 931
freedomintelligence/mllm-bench Evaluates and compares the performance of multimodal large language models on various tasks 55
maximilian-winter/llama-cpp-agent A tool for easy interaction with Large Language Models (LLMs) to execute structured function calls and generate structured output. 493
allenai/reward-bench A comprehensive benchmarking framework for evaluating the performance and safety of reward models in reinforcement learning. 429
internlm/lagent A lightweight framework for building agent-based applications using LLMs and transformer architectures 1,865
open-compass/lawbench Evaluates the legal knowledge of large language models using a custom benchmarking framework. 267
thudm/longalign A framework for training and evaluating large language models on long context inputs 217
multimodal-art-projection/omnibench Evaluates and benchmarks multimodal language models' ability to process visual, acoustic, and textual inputs simultaneously. 14
allenai/olmo-eval An evaluation framework for large language models. 310