BIG-bench
Language model benchmark
A benchmark designed to probe large language models and extrapolate their future capabilities through a diverse set of tasks.
Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models
3k stars
51 watching
593 forks
Language: Python
last commit: 6 months ago
Linked from 2 awesome lists
Related projects:
Repository | Description | Stars |
---|---|---|
bigscience-workshop/promptsource | A toolkit for creating and using natural language prompts to enable large language models to generalize to new tasks. | 2,718 |
brexhq/prompt-engineering | Guides software developers on how to effectively use and build systems around Large Language Models like GPT-4. | 8,487 |
kostya/benchmarks | A collection of benchmarking tests for various programming languages | 2,825 |
fminference/flexllmgen | Generates large language model outputs in high-throughput mode on single GPUs | 9,236 |
microsoft/promptbench | A unified framework for evaluating large language models' performance and robustness in various scenarios. | 2,487 |
openbmb/bmtools | Tools and platform for building and extending large language models | 2,907 |
huggingface/text-generation-inference | A toolkit for deploying and serving Large Language Models (LLMs) for high-performance text generation | 9,456 |
google/benchmark | A microbenchmarking library that allows users to measure the execution time of specific code snippets | 9,113 |
optimalscale/lmflow | A toolkit for fine-tuning and inferring large machine learning models | 8,312 |
openbmb/toolbench | A platform for training, serving, and evaluating large language models to enable tool use capability | 4,888 |
brightmart/text_classification | An NLP project offering various text classification models and techniques for deep learning exploration | 7,881 |
felixgithub2017/mmcu | Measures the understanding of massive multitask Chinese datasets using large language models | 87 |
deepseek-ai/deepseek-v2 | A high-performance mixture-of-experts language model with strong performance and efficient inference capabilities. | 3,758 |
confident-ai/deepeval | A framework for evaluating large language models | 4,003 |
tianyi-lab/hallusionbench | An image-context reasoning benchmark designed to challenge large vision-language models and help improve their accuracy | 259 |