langtest
Model Tester
A tool for testing and evaluating large language models with a focus on AI safety and model assessment.
Deliver safe & effective language models
501 stars
10 watching
40 forks
Language: Python
last commit: 9 days ago ai-safetyai-testingartificial-intelligencebenchmark-frameworkbenchmarksethics-in-ailarge-language-modelsllmllm-as-evaluatorllm-evaluation-toolkitllm-testllm-testingml-safetyml-testingmlopsmodel-assessmentnlpresponsible-aitrustworthy-ai
Related projects:
Repository | Description | Stars |
---|---|---|
howiehwong/trustllm | A toolkit for assessing trustworthiness in large language models | 466 |
aiplanethub/beyondllm | An open-source toolkit for building and evaluating large language models | 261 |
declare-lab/instruct-eval | An evaluation framework for large language models trained with instruction tuning methods | 528 |
neulab/explainaboard | An interactive tool to analyze and compare the performance of natural language processing models | 361 |
vhellendoorn/code-lms | A guide to using pre-trained large language models in source code analysis and generation | 1,782 |
comet-ml/opik | An end-to-end platform for evaluating and testing large language models. | 2,121 |
freedomintelligence/mllm-bench | Evaluates and compares the performance of multimodal large language models on various tasks | 55 |
innogames/ltc | A tool for managing load tests and analyzing performance results | 198 |
qcri/llmebench | A benchmarking framework for large language models | 80 |
maluuba/nlg-eval | A toolset for evaluating and comparing natural language generation models | 1,347 |
openlmlab/gaokao-bench | An evaluation framework using Chinese high school examination questions to assess large language model capabilities | 551 |
flagai-open/aquila2 | Provides pre-trained language models and tools for fine-tuning and evaluation | 437 |
bilibili/index-1.9b | A lightweight, multilingual language model with a long context length | 904 |
01-ai/yi | A series of large language models trained from scratch to excel in multiple NLP tasks | 7,699 |
ailab-cvc/seed-bench | A benchmark for evaluating large language models' ability to process multimodal input | 315 |