bananalyzer
Web task evaluator
A tool to evaluate AI agents on web tasks by dynamically constructing and executing test suites against predefined example websites.
Open source AI Agent evaluation framework for web tasks 🐒🍌
267 stars
2 watching
21 forks
Language: Python
last commit: 22 days ago
Linked from 1 awesome list
Related projects:
Repository | Description | Stars |
---|---|---|
allenai/reward-bench | A comprehensive benchmarking framework for evaluating the performance and safety of reward models in reinforcement learning. | 437 |
tatsu-lab/alpaca_eval | An automatic evaluation tool for large language models | 1,526 |
openai/simple-evals | A library for evaluating language models using standardized prompts and benchmarking tests. | 1,939 |
rlancemartin/auto-evaluator | An evaluation tool for question-answering systems using large language models and natural language processing techniques | 1,063 |
saucepleez/taskt | A process automation tool that allows users to design and execute rule-based automation without writing application code. | 1,110 |
allenai/olmo-eval | An evaluation framework for large language models. | 311 |
declare-lab/instruct-eval | An evaluation framework for large language models trained with instruction tuning methods | 528 |
obss/jury | A comprehensive toolkit for evaluating NLP experiments offering automated metrics and efficient computation. | 188 |
gomate-community/rageval | An evaluation tool for Retrieval-augmented Generation methods | 132 |
emrekavur/chaos-evaluation | Evaluates segmentation performance in medical imaging using multiple metrics | 57 |
corca-ai/eval | A tool that utilizes AI and automation to execute complex tasks and generate code in response to user requests. | 869 |
amazon-science/ragchecker | An automated evaluation framework for assessing and diagnosing Retrieval-Augmented Generation systems. | 552 |
michaelgena/rebby | A JavaScript-based tool for automating repetitive tasks in software development. | 1 |
stanford-futuredata/ares | A tool for automatically evaluating RAG models by generating synthetic data and fine-tuning classifiers | 483 |
aliostad/superbenchmarker | A performance testing tool for web applications and APIs | 572 |