AgentBench

Agent evaluation platform

A benchmark suite for evaluating the ability of large language models to operate as autonomous agents in various environments

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

GitHub

2k stars

28 watching

167 forks

Language: Python

last commit: over 1 year ago

Linked from 1 awesome list

chatgptgpt-4llmllm-agent

llmbench.ai

Backlinks from these awesome lists:

scarletpan/awesome-autonomous-gpt

Related projects:

Repository	Description	Stars
agenta-ai/agenta	An end-to-end platform for building and deploying large language model applications	1,624
opengvlab/lamm	A framework and benchmark for training and evaluating multi-modal large language models, enabling the development of AI agents capable of seamless interaction between humans and machines.	305
damo-nlp-sg/m3exam	A benchmark for evaluating large language models in multiple languages and formats	93
tianyi-lab/hallusionbench	An image-context reasoning benchmark designed to challenge large vision-language models and help improve their accuracy	259
melih-unsal/demogpt	A comprehensive toolset for building Large Language Model (LLM) based applications	1,733
qcri/llmebench	A benchmarking framework for large language models	81
mpaepper/llm_agents	Builds agents controlled by large language models (LLMs) to perform tasks with tool-based components	940
freedomintelligence/mllm-bench	Evaluates and compares the performance of multimodal large language models on various tasks	56
maximilian-winter/llama-cpp-agent	A tool for easy interaction with Large Language Models (LLMs) to execute structured function calls and generate structured output.	505
allenai/reward-bench	A comprehensive benchmarking framework for evaluating the performance and safety of reward models in reinforcement learning.	459
internlm/lagent	A lightweight framework for building agent-based applications using LLMs and transformer architectures	1,924
open-compass/lawbench	Evaluates the legal knowledge of large language models using a custom benchmarking framework.	273
thudm/longalign	A framework for training and evaluating large language models on long context inputs	230
multimodal-art-projection/omnibench	Evaluates and benchmarks multimodal language models' ability to process visual, acoustic, and textual inputs simultaneously.	15
allenai/olmo-eval	A framework for evaluating language models on NLP tasks	326