dlrover

Distributed Training System

Automates large-scale deep learning training on distributed clusters, providing fault tolerance and fast recovery from failures.

DLRover: An Automatic Distributed Deep Learning System

GitHub

1k stars

49 watching

168 forks

Language: Python

last commit: 8 months ago

Linked from 1 awesome list

distributed-traininghacktoberfestk8sllm-training

Backlinks from these awesome lists:

ethicalml/awesome-production-machine-learning

Related projects:

Repository	Description	Stars
wyy-123-xyy/ra-fed	A Python implementation of a distributed machine learning framework for training neural networks on multiple GPUs	6
learning-at-home/hivemind	A PyTorch library for decentralized deep learning across the Internet.	2,078
open-mmlab/mmengine	Provides a flexible and configurable framework for training deep learning models with PyTorch.	1,196
nitrain/nitrain	A framework-agnostic Python library for training AI models on medical images	1,865
madrylab/robustness	A library for training and evaluating neural networks with a focus on adversarial robustness.	921
aporia-ai/mlnotify	Automated notification system for machine learning model training	343
tiger-ai-lab/uniir	Trains and evaluates a universal multimodal retrieval model to perform various information retrieval tasks.	114
loudinthecloud/dpwa	A distributed learning framework that enables peer-to-peer parameter averaging and asynchronous training of deep neural networks	53
geek-ai/magent	A platform for multi-agent reinforcement learning research and development	1,700
ardanlabs/training-ai	Provides training materials and tools for building machine learning applications	72
doudar/smartspin2k	Turns spin bikes into smart trainers with automatic resistance control and online connectivity to cycling apps	189
tdeboissiere/deeplearningimplementations	A collection of implementations of recent deep learning papers in Python	1,814
google-deepmind/meltingpot	Assesses generalization of multi-agent reinforcement learning algorithms to novel social situations	637
ahmedfgad/neuralgenetic	Trains artificial neural networks using the genetic algorithm	241
deepseek-ai/deepseek-moe	A large language model with improved efficiency and performance compared to similar models	1,024