flash-attention
Attention algorithms
Implementations of efficient exact attention mechanisms for machine learning
Fast and memory-efficient exact attention
15k stars
122 watching
1k forks
Language: Python
last commit: about 1 month ago
Linked from 1 awesome list
Related projects:
Repository | Description | Stars |
---|---|---|
facebookincubator/aitemplate | A framework that transforms deep neural networks into high-performance GPU-optimized C++ code for efficient inference serving. | 4,573 |
luolc/adabound | An optimizer that combines the benefits of Adam and SGD algorithms | 2,908 |
facebookresearch/slowfast | Provides state-of-the-art video understanding codebase with efficient training methods and pre-trained models for various tasks | 6,680 |
albumentations-team/albumentations | A Python library providing a flexible and fast image augmentation tool for machine learning and computer vision tasks. | 14,386 |
arrayfire/arrayfire | A high-level abstraction of data on parallel architectures for efficient tensor computing and machine learning applications. | 4,587 |
microsoft/flaml | Automates machine learning workflows and optimizes model performance using large language models and efficient algorithms | 3,968 |
rapidsai/cudf | A GPU-accelerated data manipulation library built on top of C++/CUDA and Apache Arrow. | 8,534 |
dynamorio/drmemory | An open-source memory debugger for multiple operating systems and platforms | 2,468 |
bytedance/byteps | A high-performance distributed deep learning framework supporting multiple frameworks and networks | 3,635 |
rapidsai/cuml | A suite of libraries implementing machine learning algorithms and mathematical primitives on NVIDIA GPUs | 4,292 |
huggingface/accelerate | A tool to simplify training and deployment of PyTorch models on various devices and configurations | 8,056 |
microsoft/deepspeed | A deep learning optimization library that simplifies distributed training and inference on modern computing hardware. | 35,863 |
flashlight/flashlight | A C++ machine learning library with autograd support and high-performance defaults for efficient computation. | 5,300 |
ntop/pf_ring | A framework for high-speed packet processing on Linux kernels. | 2,718 |
tencent/pocketflow | A framework that automatically compresses and accelerates deep learning models to make them suitable for mobile devices with limited computational resources. | 2,787 |