jury

NLP evaluator

A comprehensive toolkit for evaluating NLP experiments offering automated metrics and efficient computation.

Comprehensive NLP Evaluation System

GitHub

188 stars
5 watching
20 forks
Language: Python
last commit: 4 months ago
Linked from 1 awesome list

datasetsevaluateevaluationhuggingfacemachine-learningmetricsnatural-language-processingnlpnlp-evaluationpythonpytorchtransformers

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
maluuba/nlg-eval A toolset for evaluating and comparing natural language generation models 1,347
huggingface/evaluate An evaluation framework for machine learning models and datasets, providing standardized metrics and tools for comparing model performance. 2,034
allenai/olmo-eval An evaluation framework for large language models. 310
nullne/evaluator An expression evaluator library written in Go. 41
openai/simple-evals A library for evaluating language models using standardized prompts and benchmarking tests. 1,939
olical/conjure An interactive environment for evaluating code within a running program. 1,785
open-compass/lawbench Evaluates the legal knowledge of large language models using a custom benchmarking framework. 267
tatsu-lab/alpaca_eval An automatic evaluation tool for large language models 1,526
princeton-nlp/charxiv An evaluation suite for assessing chart understanding in multimodal large language models. 75
ermlab/polish-word-embeddings-review An evaluation framework for Polish word embeddings prepared by various research groups using analogy tasks. 4
huggingface/lighteval A toolkit for evaluating Large Language Models across multiple backends 804
lartpang/pysodevaltoolkit A comprehensive Python toolbox for evaluating salient object detection and camouflaged object detection tasks 167
eddieantonio/ocreval A collection of tools and utilities for evaluating the performance and quality of OCR output 57
hkust-nlp/ceval An evaluation suite providing multiple-choice questions for foundation models in various disciplines, with tools for assessing model performance. 1,636
krrishdholakia/betterprompt An API for evaluating the quality of text prompts used in Large Language Models (LLMs) based on perplexity estimation 38