auto-evaluator
QA evaluator
An evaluation tool for question-answering systems using large language models and natural language processing techniques
Evaluation tool for LLM QA chains
1k stars
8 watching
95 forks
Language: Python
last commit: over 1 year ago
Linked from 2 awesome lists
Related projects:
Repository | Description | Stars |
---|---|---|
langchain-ai/auto-evaluator | Automated evaluation of language models for question answering tasks | 744 |
allenai/olmo-eval | An evaluation framework for large language models. | 310 |
mlabonne/llm-autoeval | A tool to automate the evaluation of large language models in Google Colab using various benchmarks and custom parameters. | 558 |
krrishdholakia/betterprompt | An API for evaluating the quality of text prompts used in Large Language Models (LLMs) based on perplexity estimation | 38 |
tatsu-lab/alpaca_eval | An automatic evaluation tool for large language models | 1,526 |
openai/simple-evals | A library for evaluating language models using standardized prompts and benchmarking tests. | 1,939 |
gomate-community/rageval | An evaluation tool for Retrieval-augmented Generation methods | 132 |
declare-lab/instruct-eval | An evaluation framework for large language models trained with instruction tuning methods | 528 |
reworkd/bananalyzer | A tool to evaluate AI agents on web tasks by dynamically constructing and executing test suites against predefined example websites. | 267 |
evolvinglmms-lab/lmms-eval | Tools and evaluation suite for large multimodal models | 2,058 |
obss/jury | A comprehensive toolkit for evaluating NLP experiments offering automated metrics and efficient computation. | 188 |
stanford-futuredata/ares | A tool for automatically evaluating RAG models by generating synthetic data and fine-tuning classifiers | 483 |
chenllliang/mmevalpro | A benchmarking framework for evaluating Large Multimodal Models by providing rigorous metrics and an efficient evaluation pipeline. | 22 |
allenai/document-qa | Tools and codebase for training neural question answering models on multiple paragraphs of text data | 434 |
allenai/reward-bench | A comprehensive benchmarking framework for evaluating the performance and safety of reward models in reinforcement learning. | 429 |