auto-evaluator

QA evaluator

An evaluation tool for question-answering systems using large language models and natural language processing techniques

Evaluation tool for LLM QA chains

GitHub

1k stars
8 watching
95 forks
Language: Python
last commit: over 1 year ago
Linked from 2 awesome lists


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
langchain-ai/auto-evaluator Automated evaluation of language models for question answering tasks 749
allenai/olmo-eval A framework for evaluating language models on NLP tasks 326
mlabonne/llm-autoeval A tool to automate the evaluation of large language models in Google Colab using various benchmarks and custom parameters. 566
krrishdholakia/betterprompt An API for evaluating the quality of text prompts used in Large Language Models (LLMs) based on perplexity estimation 43
tatsu-lab/alpaca_eval An automatic evaluation tool for large language models 1,568
openai/simple-evals Evaluates language models using standardized benchmarks and prompting techniques. 2,059
gomate-community/rageval An evaluation tool for Retrieval-augmented Generation methods 141
declare-lab/instruct-eval An evaluation framework for large language models trained with instruction tuning methods 535
reworkd/bananalyzer A tool to evaluate AI agents on web tasks by dynamically constructing and executing test suites against predefined example websites. 274
evolvinglmms-lab/lmms-eval Tools and evaluation framework for accelerating the development of large multimodal models by providing an efficient way to assess their performance 2,164
obss/jury A comprehensive toolkit for evaluating NLP experiments offering automated metrics and efficient computation. 187
stanford-futuredata/ares A tool for automatically evaluating RAG models by generating synthetic data and fine-tuning classifiers 499
chenllliang/mmevalpro A benchmarking framework for evaluating Large Multimodal Models by providing rigorous metrics and an efficient evaluation pipeline. 22
allenai/document-qa Tools and codebase for training neural question answering models on multiple paragraphs of text data 435
allenai/reward-bench A comprehensive benchmarking framework for evaluating the performance and safety of reward models in reinforcement learning. 459