betterprompt
Prompt evaluator
An API for evaluating the quality of text prompts used in Large Language Models (LLMs) based on perplexity estimation
Test suite for LLM prompts
38 stars
3 watching
4 forks
Language: Python
last commit: 6 months ago Related projects:
Repository | Description | Stars |
---|---|---|
vaibkumr/prompt-optimizer | A tool to reduce the complexity of text prompts to minimize API costs and model computations. | 241 |
mshukor/evalign-icl | Evaluating and improving large multimodal models through in-context learning | 20 |
openai/simple-evals | A library for evaluating language models using standardized prompts and benchmarking tests. | 1,939 |
pkunlp-icler/pca-eval | An open-source benchmark and evaluation tool for assessing multimodal large language models' performance in embodied decision-making tasks | 100 |
ruixiangcui/agieval | Evaluates foundation models on human-centric tasks with diverse exams and question types | 708 |
rlancemartin/auto-evaluator | An evaluation tool for question-answering systems using large language models and natural language processing techniques | 1,063 |
open-compass/vlmevalkit | A toolkit for evaluating large vision-language models on various benchmarks and datasets. | 1,343 |
emrekavur/chaos-evaluation | Evaluates segmentation performance in medical imaging using multiple metrics | 57 |
princeton-nlp/charxiv | An evaluation suite for assessing chart understanding in multimodal large language models. | 75 |
milvlg/prophet | An implementation of a two-stage framework designed to prompt large language models with answer heuristics for knowledge-based visual question answering tasks. | 267 |
freedomintelligence/mllm-bench | Evaluates and compares the performance of multimodal large language models on various tasks | 55 |
chenllliang/mmevalpro | A benchmarking framework for evaluating Large Multimodal Models by providing rigorous metrics and an efficient evaluation pipeline. | 22 |
obss/jury | A comprehensive toolkit for evaluating NLP experiments offering automated metrics and efficient computation. | 188 |
dfki-nlp/gevalm | Evaluates German transformer language models with syntactic agreement tests | 7 |
declare-lab/instruct-eval | An evaluation framework for large language models trained with instruction tuning methods | 528 |