jury
NLP evaluator
A comprehensive toolkit for evaluating NLP experiments offering automated metrics and efficient computation.
Comprehensive NLP Evaluation System
188 stars
5 watching
20 forks
Language: Python
last commit: 4 months ago
Linked from 1 awesome list
datasetsevaluateevaluationhuggingfacemachine-learningmetricsnatural-language-processingnlpnlp-evaluationpythonpytorchtransformers
Related projects:
Repository | Description | Stars |
---|---|---|
maluuba/nlg-eval | A toolset for evaluating and comparing natural language generation models | 1,347 |
huggingface/evaluate | An evaluation framework for machine learning models and datasets, providing standardized metrics and tools for comparing model performance. | 2,034 |
allenai/olmo-eval | An evaluation framework for large language models. | 310 |
nullne/evaluator | An expression evaluator library written in Go. | 41 |
openai/simple-evals | A library for evaluating language models using standardized prompts and benchmarking tests. | 1,939 |
olical/conjure | An interactive environment for evaluating code within a running program. | 1,785 |
open-compass/lawbench | Evaluates the legal knowledge of large language models using a custom benchmarking framework. | 267 |
tatsu-lab/alpaca_eval | An automatic evaluation tool for large language models | 1,526 |
princeton-nlp/charxiv | An evaluation suite for assessing chart understanding in multimodal large language models. | 75 |
ermlab/polish-word-embeddings-review | An evaluation framework for Polish word embeddings prepared by various research groups using analogy tasks. | 4 |
huggingface/lighteval | A toolkit for evaluating Large Language Models across multiple backends | 804 |
lartpang/pysodevaltoolkit | A comprehensive Python toolbox for evaluating salient object detection and camouflaged object detection tasks | 167 |
eddieantonio/ocreval | A collection of tools and utilities for evaluating the performance and quality of OCR output | 57 |
hkust-nlp/ceval | An evaluation suite providing multiple-choice questions for foundation models in various disciplines, with tools for assessing model performance. | 1,636 |
krrishdholakia/betterprompt | An API for evaluating the quality of text prompts used in Large Language Models (LLMs) based on perplexity estimation | 38 |