bananalyzer

Web task evaluator

A tool to evaluate AI agents on web tasks by dynamically constructing and executing test suites against predefined example websites.

Open source AI Agent evaluation framework for web tasks 🐒🍌

GitHub

267 stars
2 watching
21 forks
Language: Python
last commit: 22 days ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
allenai/reward-bench A comprehensive benchmarking framework for evaluating the performance and safety of reward models in reinforcement learning. 437
tatsu-lab/alpaca_eval An automatic evaluation tool for large language models 1,526
openai/simple-evals A library for evaluating language models using standardized prompts and benchmarking tests. 1,939
rlancemartin/auto-evaluator An evaluation tool for question-answering systems using large language models and natural language processing techniques 1,063
saucepleez/taskt A process automation tool that allows users to design and execute rule-based automation without writing application code. 1,110
allenai/olmo-eval An evaluation framework for large language models. 311
declare-lab/instruct-eval An evaluation framework for large language models trained with instruction tuning methods 528
obss/jury A comprehensive toolkit for evaluating NLP experiments offering automated metrics and efficient computation. 188
gomate-community/rageval An evaluation tool for Retrieval-augmented Generation methods 132
emrekavur/chaos-evaluation Evaluates segmentation performance in medical imaging using multiple metrics 57
corca-ai/eval A tool that utilizes AI and automation to execute complex tasks and generate code in response to user requests. 869
amazon-science/ragchecker An automated evaluation framework for assessing and diagnosing Retrieval-Augmented Generation systems. 552
michaelgena/rebby A JavaScript-based tool for automating repetitive tasks in software development. 1
stanford-futuredata/ares A tool for automatically evaluating RAG models by generating synthetic data and fine-tuning classifiers 483
aliostad/superbenchmarker A performance testing tool for web applications and APIs 572