langtest
Model Tester
A tool for testing and evaluating large language models with a focus on AI safety and model assessment.
Deliver safe & effective language models
506 stars
10 watching
41 forks
Language: Python
last commit: about 1 month ago ai-safetyai-testingartificial-intelligencebenchmark-frameworkbenchmarksethics-in-ailarge-language-modelsllmllm-as-evaluatorllm-evaluation-toolkitllm-testllm-testingml-safetyml-testingmlopsmodel-assessmentnlpresponsible-aitrustworthy-ai
Related projects:
Repository | Description | Stars |
---|---|---|
howiehwong/trustllm | A toolkit for assessing trustworthiness in large language models | 491 |
aiplanethub/beyondllm | An open-source toolkit for building and evaluating large language models | 267 |
declare-lab/instruct-eval | An evaluation framework for large language models trained with instruction tuning methods | 535 |
neulab/explainaboard | An interactive tool to analyze and compare the performance of natural language processing models | 362 |
vhellendoorn/code-lms | A guide to using pre-trained large language models in source code analysis and generation | 1,789 |
comet-ml/opik | A platform for evaluating and testing large language models (LLMs) during development and production. | 2,588 |
freedomintelligence/mllm-bench | Evaluates and compares the performance of multimodal large language models on various tasks | 56 |
innogames/ltc | A tool for managing load tests and analyzing performance results | 200 |
qcri/llmebench | A benchmarking framework for large language models | 81 |
maluuba/nlg-eval | A toolset for evaluating and comparing natural language generation models | 1,350 |
openlmlab/gaokao-bench | An evaluation framework using Chinese high school examination questions to assess large language model capabilities | 565 |
flagai-open/aquila2 | Provides pre-trained language models and tools for fine-tuning and evaluation | 439 |
bilibili/index-1.9b | A lightweight, multilingual language model with a long context length | 920 |
01-ai/yi | A series of large language models trained from scratch to excel in multiple NLP tasks | 7,743 |
ailab-cvc/seed-bench | A benchmark for evaluating large language models' ability to process multimodal input | 322 |