langtest

Model Tester

A tool for testing and evaluating large language models with a focus on AI safety and model assessment.

Deliver safe & effective language models

GitHub

506 stars
10 watching
41 forks
Language: Python
last commit: about 1 month ago
ai-safetyai-testingartificial-intelligencebenchmark-frameworkbenchmarksethics-in-ailarge-language-modelsllmllm-as-evaluatorllm-evaluation-toolkitllm-testllm-testingml-safetyml-testingmlopsmodel-assessmentnlpresponsible-aitrustworthy-ai

Related projects:

Repository Description Stars
howiehwong/trustllm A toolkit for assessing trustworthiness in large language models 491
aiplanethub/beyondllm An open-source toolkit for building and evaluating large language models 267
declare-lab/instruct-eval An evaluation framework for large language models trained with instruction tuning methods 535
neulab/explainaboard An interactive tool to analyze and compare the performance of natural language processing models 362
vhellendoorn/code-lms A guide to using pre-trained large language models in source code analysis and generation 1,789
comet-ml/opik A platform for evaluating and testing large language models (LLMs) during development and production. 2,588
freedomintelligence/mllm-bench Evaluates and compares the performance of multimodal large language models on various tasks 56
innogames/ltc A tool for managing load tests and analyzing performance results 200
qcri/llmebench A benchmarking framework for large language models 81
maluuba/nlg-eval A toolset for evaluating and comparing natural language generation models 1,350
openlmlab/gaokao-bench An evaluation framework using Chinese high school examination questions to assess large language model capabilities 565
flagai-open/aquila2 Provides pre-trained language models and tools for fine-tuning and evaluation 439
bilibili/index-1.9b A lightweight, multilingual language model with a long context length 920
01-ai/yi A series of large language models trained from scratch to excel in multiple NLP tasks 7,743
ailab-cvc/seed-bench A benchmark for evaluating large language models' ability to process multimodal input 322