dlrover
Distributed Training System
Automates large-scale deep learning training on distributed clusters, providing fault tolerance and fast recovery from failures.
DLRover: An Automatic Distributed Deep Learning System
1k stars
49 watching
168 forks
Language: Python
last commit: about 1 month ago
Linked from 1 awesome list
distributed-traininghacktoberfestk8sllm-training
Related projects:
Repository | Description | Stars |
---|---|---|
wyy-123-xyy/ra-fed | A Python implementation of a distributed machine learning framework for training neural networks on multiple GPUs | 6 |
learning-at-home/hivemind | A PyTorch library for decentralized deep learning across the Internet. | 2,078 |
open-mmlab/mmengine | Provides a flexible and configurable framework for training deep learning models with PyTorch. | 1,196 |
nitrain/nitrain | A framework-agnostic Python library for training AI models on medical images | 1,865 |
madrylab/robustness | A library for training and evaluating neural networks with a focus on adversarial robustness. | 921 |
aporia-ai/mlnotify | Automated notification system for machine learning model training | 343 |
tiger-ai-lab/uniir | Trains and evaluates a universal multimodal retrieval model to perform various information retrieval tasks. | 114 |
loudinthecloud/dpwa | A distributed learning framework that enables peer-to-peer parameter averaging and asynchronous training of deep neural networks | 53 |
geek-ai/magent | A platform for multi-agent reinforcement learning research and development | 1,700 |
ardanlabs/training-ai | Provides training materials and tools for building machine learning applications | 72 |
doudar/smartspin2k | Turns spin bikes into smart trainers with automatic resistance control and online connectivity to cycling apps | 189 |
tdeboissiere/deeplearningimplementations | A collection of implementations of recent deep learning papers in Python | 1,814 |
google-deepmind/meltingpot | Assesses generalization of multi-agent reinforcement learning algorithms to novel social situations | 637 |
ahmedfgad/neuralgenetic | Trains artificial neural networks using the genetic algorithm | 241 |
deepseek-ai/deepseek-moe | A large language model with improved efficiency and performance compared to similar models | 1,024 |