dlrover

Distributed Training System

Automates large-scale deep learning training on distributed clusters, providing fault tolerance and fast recovery from failures.

DLRover: An Automatic Distributed Deep Learning System

GitHub

1k stars
49 watching
168 forks
Language: Python
last commit: about 1 month ago
Linked from 1 awesome list

distributed-traininghacktoberfestk8sllm-training

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
wyy-123-xyy/ra-fed A Python implementation of a distributed machine learning framework for training neural networks on multiple GPUs 6
learning-at-home/hivemind A PyTorch library for decentralized deep learning across the Internet. 2,078
open-mmlab/mmengine Provides a flexible and configurable framework for training deep learning models with PyTorch. 1,196
nitrain/nitrain A framework-agnostic Python library for training AI models on medical images 1,865
madrylab/robustness A library for training and evaluating neural networks with a focus on adversarial robustness. 921
aporia-ai/mlnotify Automated notification system for machine learning model training 343
tiger-ai-lab/uniir Trains and evaluates a universal multimodal retrieval model to perform various information retrieval tasks. 114
loudinthecloud/dpwa A distributed learning framework that enables peer-to-peer parameter averaging and asynchronous training of deep neural networks 53
geek-ai/magent A platform for multi-agent reinforcement learning research and development 1,700
ardanlabs/training-ai Provides training materials and tools for building machine learning applications 72
doudar/smartspin2k Turns spin bikes into smart trainers with automatic resistance control and online connectivity to cycling apps 189
tdeboissiere/deeplearningimplementations A collection of implementations of recent deep learning papers in Python 1,814
google-deepmind/meltingpot Assesses generalization of multi-agent reinforcement learning algorithms to novel social situations 637
ahmedfgad/neuralgenetic Trains artificial neural networks using the genetic algorithm 241
deepseek-ai/deepseek-moe A large language model with improved efficiency and performance compared to similar models 1,024