reward-bench
Reward model evaluator
A comprehensive benchmarking framework for evaluating the performance and safety of reward models in reinforcement learning.
RewardBench: the first evaluation tool for reward models.
459 stars
5 watching
54 forks
Language: Python
last commit: 3 months ago
Linked from 1 awesome list
preference-learningrlhf
Related projects:
Repository | Description | Stars |
---|---|---|
| A framework for evaluating language models on NLP tasks | 326 |
| Evaluates language models using standardized benchmarks and prompting techniques. | 2,059 |
| A platform for comparing and evaluating AI and machine learning algorithms at scale | 1,779 |
| An evaluation framework for machine learning models and datasets, providing standardized metrics and tools for comparing model performance. | 2,063 |
| A benchmark suite for evaluating the ability of large language models to operate as autonomous agents in various environments | 2,272 |
| A benchmarking framework for evaluating Large Multimodal Models by providing rigorous metrics and an efficient evaluation pipeline. | 22 |
| An eXplainability toolbox for machine learning that enables data analysis and model evaluation to mitigate biases and improve performance | 1,135 |
| A software framework for organizing and running machine learning experiments with Python. | 533 |
| A tool to evaluate AI agents on web tasks by dynamically constructing and executing test suites against predefined example websites. | 274 |
| A standardized benchmark for measuring the robustness of machine learning models against adversarial attacks | 682 |
| An implementation of an actor-critic reinforcement learning algorithm in Python. | 245 |
| A benchmark suite for unsupervised reinforcement learning agents, providing pre-trained models and scripts for testing and fine-tuning agent performance. | 335 |
| A toolset for evaluating and comparing natural language generation models | 1,350 |
| A framework for evaluating OpenAI models and an open-source registry of benchmarks. | 19 |
| Tools for comparing and benchmarking small code snippets | 514 |