bananalyzer

Web task evaluator

A tool to evaluate AI agents on web tasks by dynamically constructing and executing test suites against predefined example websites.

Open source AI Agent evaluation framework for web tasks 🐒🍌

GitHub

274 stars

3 watching

21 forks

Language: Python

last commit: 8 months ago

Linked from 1 awesome list

Backlinks from these awesome lists:

ethicalml/awesome-production-machine-learning

Related projects:

Repository	Description	Stars
allenai/reward-bench	A comprehensive benchmarking framework for evaluating the performance and safety of reward models in reinforcement learning.	459
tatsu-lab/alpaca_eval	An automatic evaluation tool for large language models	1,568
openai/simple-evals	Evaluates language models using standardized benchmarks and prompting techniques.	2,059
rlancemartin/auto-evaluator	An evaluation tool for question-answering systems using large language models and natural language processing techniques	1,065
saucepleez/taskt	A process automation tool that allows users to design and execute rule-based automation without writing application code.	1,125
allenai/olmo-eval	A framework for evaluating language models on NLP tasks	326
declare-lab/instruct-eval	An evaluation framework for large language models trained with instruction tuning methods	535
obss/jury	A comprehensive toolkit for evaluating NLP experiments offering automated metrics and efficient computation.	187
gomate-community/rageval	An evaluation tool for Retrieval-augmented Generation methods	141
emrekavur/chaos-evaluation	Evaluates segmentation performance in medical imaging using multiple metrics	57
corca-ai/eval	A tool that utilizes AI and automation to execute complex tasks and generate code in response to user requests.	869
amazon-science/ragchecker	A framework for evaluating and diagnosing retrieval-augmented generation systems	630
michaelgena/rebby	A JavaScript-based tool for automating repetitive tasks in software development.	1
stanford-futuredata/ares	A tool for automatically evaluating RAG models by generating synthetic data and fine-tuning classifiers	499
aliostad/superbenchmarker	A performance testing tool for web applications and APIs	572