jury

NLP evaluator

A comprehensive toolkit for evaluating NLP experiments offering automated metrics and efficient computation.

Comprehensive NLP Evaluation System

GitHub

187 stars

5 watching

20 forks

Language: Python

last commit: 12 months ago

Linked from 1 awesome list

datasetsevaluateevaluationhuggingfacemachine-learningmetricsnatural-language-processingnlpnlp-evaluationpythonpytorchtransformers

Backlinks from these awesome lists:

keon/awesome-nlp

Related projects:

Repository	Description	Stars
maluuba/nlg-eval	A toolset for evaluating and comparing natural language generation models	1,350
huggingface/evaluate	An evaluation framework for machine learning models and datasets, providing standardized metrics and tools for comparing model performance.	2,063
allenai/olmo-eval	A framework for evaluating language models on NLP tasks	326
nullne/evaluator	An expression evaluator library written in Go.	41
openai/simple-evals	Evaluates language models using standardized benchmarks and prompting techniques.	2,059
olical/conjure	An interactive environment for evaluating code within a running program.	1,806
open-compass/lawbench	Evaluates the legal knowledge of large language models using a custom benchmarking framework.	273
tatsu-lab/alpaca_eval	An automatic evaluation tool for large language models	1,568
princeton-nlp/charxiv	An evaluation suite for assessing chart understanding in multimodal large language models.	85
ermlab/polish-word-embeddings-review	An evaluation framework for Polish word embeddings prepared by various research groups using analogy tasks.	4
huggingface/lighteval	An all-in-one toolkit for evaluating Large Language Models (LLMs) across multiple backends.	879
lartpang/pysodevaltoolkit	A comprehensive Python toolbox for evaluating salient object detection and camouflaged object detection tasks	168
eddieantonio/ocreval	A collection of tools and utilities for evaluating the performance and quality of OCR output	57
hkust-nlp/ceval	An evaluation suite providing multiple-choice questions for foundation models in various disciplines, with tools for assessing model performance.	1,650
krrishdholakia/betterprompt	An API for evaluating the quality of text prompts used in Large Language Models (LLMs) based on perplexity estimation	43