AgentBench
Agent evaluation platform
A benchmark suite for evaluating the ability of large language models to operate as autonomous agents in various environments
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
2k stars
28 watching
167 forks
Language: Python
last commit: 2 months ago
Linked from 1 awesome list
chatgptgpt-4llmllm-agent
Related projects:
Repository | Description | Stars |
---|---|---|
agenta-ai/agenta | An end-to-end platform for building and deploying large language model applications | 1,624 |
opengvlab/lamm | A framework and benchmark for training and evaluating multi-modal large language models, enabling the development of AI agents capable of seamless interaction between humans and machines. | 305 |
damo-nlp-sg/m3exam | A benchmark for evaluating large language models in multiple languages and formats | 93 |
tianyi-lab/hallusionbench | An image-context reasoning benchmark designed to challenge large vision-language models and help improve their accuracy | 259 |
melih-unsal/demogpt | A comprehensive toolset for building Large Language Model (LLM) based applications | 1,733 |
qcri/llmebench | A benchmarking framework for large language models | 81 |
mpaepper/llm_agents | Builds agents controlled by large language models (LLMs) to perform tasks with tool-based components | 940 |
freedomintelligence/mllm-bench | Evaluates and compares the performance of multimodal large language models on various tasks | 56 |
maximilian-winter/llama-cpp-agent | A tool for easy interaction with Large Language Models (LLMs) to execute structured function calls and generate structured output. | 505 |
allenai/reward-bench | A comprehensive benchmarking framework for evaluating the performance and safety of reward models in reinforcement learning. | 459 |
internlm/lagent | A lightweight framework for building agent-based applications using LLMs and transformer architectures | 1,924 |
open-compass/lawbench | Evaluates the legal knowledge of large language models using a custom benchmarking framework. | 273 |
thudm/longalign | A framework for training and evaluating large language models on long context inputs | 230 |
multimodal-art-projection/omnibench | Evaluates and benchmarks multimodal language models' ability to process visual, acoustic, and textual inputs simultaneously. | 15 |
allenai/olmo-eval | A framework for evaluating language models on NLP tasks | 326 |