BIG-bench

Language model benchmark

A benchmark designed to probe large language models and extrapolate their future capabilities through a diverse set of tasks.

Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models

GitHub

3k stars

51 watching

593 forks

Language: Python

last commit: about 1 year ago

Linked from 2 awesome lists

Backlinks from these awesome lists:

Related projects:

Repository	Description	Stars
bigscience-workshop/promptsource	A toolkit for creating and using natural language prompts to enable large language models to generalize to new tasks.	2,718
brexhq/prompt-engineering	Guides software developers on how to effectively use and build systems around Large Language Models like GPT-4.	8,487
kostya/benchmarks	A collection of benchmarking tests for various programming languages	2,825
fminference/flexllmgen	Generates large language model outputs in high-throughput mode on single GPUs	9,236
microsoft/promptbench	A unified framework for evaluating large language models' performance and robustness in various scenarios.	2,487
openbmb/bmtools	Tools and platform for building and extending large language models	2,907
huggingface/text-generation-inference	A toolkit for deploying and serving Large Language Models (LLMs) for high-performance text generation	9,456
google/benchmark	A microbenchmarking library that allows users to measure the execution time of specific code snippets	9,113
optimalscale/lmflow	A toolkit for fine-tuning and inferring large machine learning models	8,312
openbmb/toolbench	A platform for training, serving, and evaluating large language models to enable tool use capability	4,888
brightmart/text_classification	An NLP project offering various text classification models and techniques for deep learning exploration	7,881
felixgithub2017/mmcu	Measures the understanding of massive multitask Chinese datasets using large language models	87
deepseek-ai/deepseek-v2	A high-performance mixture-of-experts language model with strong performance and efficient inference capabilities.	3,758
confident-ai/deepeval	A framework for evaluating large language models	4,003
tianyi-lab/hallusionbench	An image-context reasoning benchmark designed to challenge large vision-language models and help improve their accuracy	259