auto-evaluator
QA evaluator
An evaluation tool for question-answering systems using large language models and natural language processing techniques
Evaluation tool for LLM QA chains
1k stars
8 watching
95 forks
Language: Python
last commit: over 1 year ago
Linked from 2 awesome lists
Related projects:
Repository | Description | Stars |
---|---|---|
langchain-ai/auto-evaluator | Automated evaluation of language models for question answering tasks | 749 |
allenai/olmo-eval | A framework for evaluating language models on NLP tasks | 326 |
mlabonne/llm-autoeval | A tool to automate the evaluation of large language models in Google Colab using various benchmarks and custom parameters. | 566 |
krrishdholakia/betterprompt | An API for evaluating the quality of text prompts used in Large Language Models (LLMs) based on perplexity estimation | 43 |
tatsu-lab/alpaca_eval | An automatic evaluation tool for large language models | 1,568 |
openai/simple-evals | Evaluates language models using standardized benchmarks and prompting techniques. | 2,059 |
gomate-community/rageval | An evaluation tool for Retrieval-augmented Generation methods | 141 |
declare-lab/instruct-eval | An evaluation framework for large language models trained with instruction tuning methods | 535 |
reworkd/bananalyzer | A tool to evaluate AI agents on web tasks by dynamically constructing and executing test suites against predefined example websites. | 274 |
evolvinglmms-lab/lmms-eval | Tools and evaluation framework for accelerating the development of large multimodal models by providing an efficient way to assess their performance | 2,164 |
obss/jury | A comprehensive toolkit for evaluating NLP experiments offering automated metrics and efficient computation. | 187 |
stanford-futuredata/ares | A tool for automatically evaluating RAG models by generating synthetic data and fine-tuning classifiers | 499 |
chenllliang/mmevalpro | A benchmarking framework for evaluating Large Multimodal Models by providing rigorous metrics and an efficient evaluation pipeline. | 22 |
allenai/document-qa | Tools and codebase for training neural question answering models on multiple paragraphs of text data | 435 |
allenai/reward-bench | A comprehensive benchmarking framework for evaluating the performance and safety of reward models in reinforcement learning. | 459 |