langtest

Model Tester

A tool for testing and evaluating large language models with a focus on AI safety and model assessment.

Deliver safe & effective language models

GitHub

506 stars

10 watching

41 forks

Language: Python

last commit: 8 months ago

ai-safetyai-testingartificial-intelligencebenchmark-frameworkbenchmarksethics-in-ailarge-language-modelsllmllm-as-evaluatorllm-evaluation-toolkitllm-testllm-testingml-safetyml-testingmlopsmodel-assessmentnlpresponsible-aitrustworthy-ai

Screenshot of JohnSnowLabs/langtest website

langtest.org/

Related projects:

Repository	Description	Stars
howiehwong/trustllm	A toolkit for assessing trustworthiness in large language models	491
aiplanethub/beyondllm	An open-source toolkit for building and evaluating large language models	267
declare-lab/instruct-eval	An evaluation framework for large language models trained with instruction tuning methods	535
neulab/explainaboard	An interactive tool to analyze and compare the performance of natural language processing models	362
vhellendoorn/code-lms	A guide to using pre-trained large language models in source code analysis and generation	1,789
comet-ml/opik	A platform for evaluating and testing large language models (LLMs) during development and production.	2,588
freedomintelligence/mllm-bench	Evaluates and compares the performance of multimodal large language models on various tasks	56
innogames/ltc	A tool for managing load tests and analyzing performance results	200
qcri/llmebench	A benchmarking framework for large language models	81
maluuba/nlg-eval	A toolset for evaluating and comparing natural language generation models	1,350
openlmlab/gaokao-bench	An evaluation framework using Chinese high school examination questions to assess large language model capabilities	565
flagai-open/aquila2	Provides pre-trained language models and tools for fine-tuning and evaluation	439
bilibili/index-1.9b	A lightweight, multilingual language model with a long context length	920
01-ai/yi	A series of large language models trained from scratch to excel in multiple NLP tasks	7,743
ailab-cvc/seed-bench	A benchmark for evaluating large language models' ability to process multimodal input	322