BIG-bench
Language model benchmark
A benchmark designed to evaluate the capabilities of large language models by simulating various tasks and measuring their performance
Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models
3k stars
51 watching
591 forks
Language: Python
last commit: 4 months ago
Linked from 2 awesome lists
Related projects:
Repository | Description | Stars |
---|---|---|
bigscience-workshop/promptsource | A toolkit for creating and using natural language prompts to enable large language models to generalize to new tasks. | 2,696 |
brexhq/prompt-engineering | Guides software developers on how to effectively use and build systems around Large Language Models like GPT-4. | 8,440 |
kostya/benchmarks | A collection of benchmarking tests for various programming languages | 2,814 |
fminference/flexllmgen | Generates large language model outputs in high-throughput mode on single GPUs | 9,192 |
microsoft/promptbench | A unified framework for evaluating large language models' performance and robustness in various scenarios. | 2,462 |
openbmb/bmtools | Tools and platform for building and extending large language models | 2,898 |
huggingface/text-generation-inference | A toolkit for deploying and serving Large Language Models. | 9,106 |
google/benchmark | A microbenchmarking library that allows users to measure the execution time of specific code snippets | 9,035 |
optimalscale/lmflow | A toolkit for finetuning large language models and providing efficient inference capabilities | 8,273 |
openbmb/toolbench | A platform for training, serving, and evaluating large language models to enable tool use capability | 4,843 |
brightmart/text_classification | An NLP project offering various text classification models and techniques for deep learning exploration | 7,861 |
felixgithub2017/mmcu | Evaluates the semantic understanding capabilities of large Chinese language models using a multimodal dataset. | 87 |
deepseek-ai/deepseek-v2 | A high-performance mixture-of-experts language model with strong performance and efficient inference capabilities. | 3,590 |
confident-ai/deepeval | A framework for evaluating large language models | 3,669 |
tianyi-lab/hallusionbench | An image-context reasoning benchmark designed to challenge large vision-language models and help improve their accuracy | 243 |