jury
NLP evaluator
A comprehensive toolkit for evaluating NLP experiments offering automated metrics and efficient computation.
Comprehensive NLP Evaluation System
187 stars
5 watching
20 forks
Language: Python
last commit: 6 months ago
Linked from 1 awesome list
datasetsevaluateevaluationhuggingfacemachine-learningmetricsnatural-language-processingnlpnlp-evaluationpythonpytorchtransformers
Related projects:
Repository | Description | Stars |
---|---|---|
maluuba/nlg-eval | A toolset for evaluating and comparing natural language generation models | 1,350 |
huggingface/evaluate | An evaluation framework for machine learning models and datasets, providing standardized metrics and tools for comparing model performance. | 2,063 |
allenai/olmo-eval | A framework for evaluating language models on NLP tasks | 326 |
nullne/evaluator | An expression evaluator library written in Go. | 41 |
openai/simple-evals | Evaluates language models using standardized benchmarks and prompting techniques. | 2,059 |
olical/conjure | An interactive environment for evaluating code within a running program. | 1,806 |
open-compass/lawbench | Evaluates the legal knowledge of large language models using a custom benchmarking framework. | 273 |
tatsu-lab/alpaca_eval | An automatic evaluation tool for large language models | 1,568 |
princeton-nlp/charxiv | An evaluation suite for assessing chart understanding in multimodal large language models. | 85 |
ermlab/polish-word-embeddings-review | An evaluation framework for Polish word embeddings prepared by various research groups using analogy tasks. | 4 |
huggingface/lighteval | An all-in-one toolkit for evaluating Large Language Models (LLMs) across multiple backends. | 879 |
lartpang/pysodevaltoolkit | A comprehensive Python toolbox for evaluating salient object detection and camouflaged object detection tasks | 168 |
eddieantonio/ocreval | A collection of tools and utilities for evaluating the performance and quality of OCR output | 57 |
hkust-nlp/ceval | An evaluation suite providing multiple-choice questions for foundation models in various disciplines, with tools for assessing model performance. | 1,650 |
krrishdholakia/betterprompt | An API for evaluating the quality of text prompts used in Large Language Models (LLMs) based on perplexity estimation | 43 |