nccl
GPU communication library
A library of optimized primitives for efficient inter-GPU communication and data transfer.
Optimized primitives for collective multi-GPU communication
3k stars
153 watching
825 forks
Language: C++
last commit: 2 months ago
Linked from 2 awesome lists
Related projects:
Repository | Description | Stars |
---|---|---|
sergio0694/computesharp | Enables C# code to run on the GPU through DirectX and dynamically generated shaders | 2,775 |
keylase/nvidia-patch | Removes Nvidia's restriction on simultaneous NVENC video encoding sessions | 3,532 |
uncomplicate/clojurecl | A Clojure library that enables parallel computations on GPU using OpenCL | 277 |
nvlabs/instant-ngp | A software toolkit for training and rendering neural graphics primitives | 16,033 |
nvidia-ai-iot/cupcl | A set of libraries and sample code for 3D point cloud processing using CUDA. | 576 |
vczh-libraries/gacui | A comprehensive C++ library for building GPU-accelerated user interfaces with WYSIWYG editing tools and XML support. | 2,348 |
nvidia/apex | Tools for streamlined mixed precision and distributed training in PyTorch | 8,407 |
nvidia/matx | A C++17 GPU-accelerated numerical computing library with Python-like syntax | 1,220 |
sony/nnabla | A deep learning framework that provides a flexible and expressive Python API for building and training neural networks on various platforms. | 2,728 |
nvidia/multi-gpu-programming-models | A collection of examples demonstrating various approaches to programming multiple GPUs in parallel | 557 |
rapidsai/cuml | A suite of libraries implementing machine learning algorithms and mathematical primitives on NVIDIA GPUs | 4,238 |
nrwl/nx | A build system designed to optimize monorepos and integrate well with various frameworks and tools for fast CI. | 23,681 |
zeux/meshoptimizer | A C++ library that optimizes 3D meshes for faster rendering on GPUs. | 5,703 |
nvlabs/tiny-cuda-nn | A C++/CUDA framework for training and querying neural networks using GPUs | 3,763 |
baidu-research/warp-ctc | An implementation of a loss function used in sequence data analysis and machine learning | 4,069 |