bananalyzer
Web task evaluator
A tool to evaluate AI agents on web tasks by dynamically constructing and executing test suites against predefined example websites.
Open source AI Agent evaluation framework for web tasks 🐒🍌
274 stars
3 watching
21 forks
Language: Python
last commit: about 2 months ago
Linked from 1 awesome list
Related projects:
Repository | Description | Stars |
---|---|---|
allenai/reward-bench | A comprehensive benchmarking framework for evaluating the performance and safety of reward models in reinforcement learning. | 459 |
tatsu-lab/alpaca_eval | An automatic evaluation tool for large language models | 1,568 |
openai/simple-evals | Evaluates language models using standardized benchmarks and prompting techniques. | 2,059 |
rlancemartin/auto-evaluator | An evaluation tool for question-answering systems using large language models and natural language processing techniques | 1,065 |
saucepleez/taskt | A process automation tool that allows users to design and execute rule-based automation without writing application code. | 1,125 |
allenai/olmo-eval | A framework for evaluating language models on NLP tasks | 326 |
declare-lab/instruct-eval | An evaluation framework for large language models trained with instruction tuning methods | 535 |
obss/jury | A comprehensive toolkit for evaluating NLP experiments offering automated metrics and efficient computation. | 187 |
gomate-community/rageval | An evaluation tool for Retrieval-augmented Generation methods | 141 |
emrekavur/chaos-evaluation | Evaluates segmentation performance in medical imaging using multiple metrics | 57 |
corca-ai/eval | A tool that utilizes AI and automation to execute complex tasks and generate code in response to user requests. | 869 |
amazon-science/ragchecker | A framework for evaluating and diagnosing retrieval-augmented generation systems | 630 |
michaelgena/rebby | A JavaScript-based tool for automating repetitive tasks in software development. | 1 |
stanford-futuredata/ares | A tool for automatically evaluating RAG models by generating synthetic data and fine-tuning classifiers | 499 |
aliostad/superbenchmarker | A performance testing tool for web applications and APIs | 572 |