BIG-bench

Language model benchmark

A benchmark designed to evaluate the capabilities of large language models by simulating various tasks and measuring their performance

Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models

GitHub

3k stars
51 watching
591 forks
Language: Python
last commit: 4 months ago
Linked from 2 awesome lists


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
bigscience-workshop/promptsource A toolkit for creating and using natural language prompts to enable large language models to generalize to new tasks. 2,696
brexhq/prompt-engineering Guides software developers on how to effectively use and build systems around Large Language Models like GPT-4. 8,440
kostya/benchmarks A collection of benchmarking tests for various programming languages 2,814
fminference/flexllmgen Generates large language model outputs in high-throughput mode on single GPUs 9,192
microsoft/promptbench A unified framework for evaluating large language models' performance and robustness in various scenarios. 2,462
openbmb/bmtools Tools and platform for building and extending large language models 2,898
huggingface/text-generation-inference A toolkit for deploying and serving Large Language Models. 9,106
google/benchmark A microbenchmarking library that allows users to measure the execution time of specific code snippets 9,035
optimalscale/lmflow A toolkit for finetuning large language models and providing efficient inference capabilities 8,273
openbmb/toolbench A platform for training, serving, and evaluating large language models to enable tool use capability 4,843
brightmart/text_classification An NLP project offering various text classification models and techniques for deep learning exploration 7,861
felixgithub2017/mmcu Evaluates the semantic understanding capabilities of large Chinese language models using a multimodal dataset. 87
deepseek-ai/deepseek-v2 A high-performance mixture-of-experts language model with strong performance and efficient inference capabilities. 3,590
confident-ai/deepeval A framework for evaluating large language models 3,669
tianyi-lab/hallusionbench An image-context reasoning benchmark designed to challenge large vision-language models and help improve their accuracy 243