flash-attention
Attention library
An open-source implementation of efficient attention mechanisms for neural networks
Fast and memory-efficient exact attention
14k stars
119 watching
1k forks
Language: Python
last commit: 5 days ago
Linked from 1 awesome list
Related projects:
Repository | Description | Stars |
---|---|---|
facebookincubator/aitemplate | A framework that transforms deep neural networks into high-performance GPU-optimized C++ code for efficient inference serving. | 4,561 |
luolc/adabound | An optimizer that combines the benefits of Adam and SGD algorithms | 2,907 |
facebookresearch/slowfast | Provides state-of-the-art video understanding codebase with efficient training methods and pre-trained models for various tasks | 6,623 |
albumentations-team/albumentations | A Python library for applying image transformations to data used in deep learning and computer vision tasks | 14,254 |
arrayfire/arrayfire | A high-level abstraction of data on parallel architectures for efficient tensor computing and machine learning applications. | 4,564 |
microsoft/flaml | Automates machine learning workflows and optimizes model performance using large language models and efficient algorithms | 3,919 |
rapidsai/cudf | A GPU-accelerated data manipulation library built on top of Arrow and libcudf. | 8,448 |
dynamorio/drmemory | An open-source memory debugger for multiple operating systems and platforms | 2,443 |
bytedance/byteps | A high-performance distributed deep learning framework supporting multiple frameworks and networks | 3,630 |
rapidsai/cuml | A suite of libraries implementing machine learning algorithms and mathematical primitives on NVIDIA GPUs | 4,238 |
huggingface/accelerate | A tool to simplify training and deployment of PyTorch models on various devices and configurations | 7,947 |
microsoft/deepspeed | A deep learning optimization library that makes distributed training and inference easy, efficient, and effective. | 35,463 |
flashlight/flashlight | A C++ machine learning library with autograd support and high-performance defaults for efficient computation. | 5,285 |
ntop/pf_ring | A framework for high-speed packet processing on Linux kernels. | 2,698 |
tencent/pocketflow | A framework that automatically compresses and accelerates deep learning models to make them suitable for mobile devices with limited computational resources. | 2,788 |