flash-attention

Attention algorithms

Implementations of efficient exact attention mechanisms for machine learning

Fast and memory-efficient exact attention

GitHub

15k stars
122 watching
1k forks
Language: Python
last commit: about 1 month ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
facebookincubator/aitemplate A framework that transforms deep neural networks into high-performance GPU-optimized C++ code for efficient inference serving. 4,573
luolc/adabound An optimizer that combines the benefits of Adam and SGD algorithms 2,908
facebookresearch/slowfast Provides state-of-the-art video understanding codebase with efficient training methods and pre-trained models for various tasks 6,680
albumentations-team/albumentations A Python library providing a flexible and fast image augmentation tool for machine learning and computer vision tasks. 14,386
arrayfire/arrayfire A high-level abstraction of data on parallel architectures for efficient tensor computing and machine learning applications. 4,587
microsoft/flaml Automates machine learning workflows and optimizes model performance using large language models and efficient algorithms 3,968
rapidsai/cudf A GPU-accelerated data manipulation library built on top of C++/CUDA and Apache Arrow. 8,534
dynamorio/drmemory An open-source memory debugger for multiple operating systems and platforms 2,468
bytedance/byteps A high-performance distributed deep learning framework supporting multiple frameworks and networks 3,635
rapidsai/cuml A suite of libraries implementing machine learning algorithms and mathematical primitives on NVIDIA GPUs 4,292
huggingface/accelerate A tool to simplify training and deployment of PyTorch models on various devices and configurations 8,056
microsoft/deepspeed A deep learning optimization library that simplifies distributed training and inference on modern computing hardware. 35,863
flashlight/flashlight A C++ machine learning library with autograd support and high-performance defaults for efficient computation. 5,300
ntop/pf_ring A framework for high-speed packet processing on Linux kernels. 2,718
tencent/pocketflow A framework that automatically compresses and accelerates deep learning models to make them suitable for mobile devices with limited computational resources. 2,787