TensorRT-LLM
Inference optimizer
A software framework providing an easy-to-use Python API to optimize Large Language Models on NVIDIA GPUs for efficient inference.
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
9k stars
95 watching
1k forks
Language: C++
last commit: 3 months ago
Linked from 1 awesome list
Related projects:
Repository | Description | Stars |
---|---|---|
| A tool for optimizing large language models by collecting feedback and metrics to improve their performance over time | 1,245 |
| Provides a set of tools and libraries for optimizing deep learning inference on NVIDIA GPUs. | 10,926 |
| A machine learning compiler and deployment engine for large language models | 19,396 |
| A high-performance transformer-based NLP component optimized for GPU acceleration and integration into various frameworks. | 5,937 |
| A deep learning optimization library that simplifies distributed training and inference on modern computing hardware. | 35,863 |
| A collection of optimized kernels and post-training loss functions for large language models | 3,840 |
| A toolkit for optimizing and serving large language models | 4,854 |
| An efficient Large Language Model inference engine leveraging consumer-grade GPUs on PCs | 8,011 |
| A toolbox implementing a sparse and low-rank tensor regression algorithm with boosting | 12 |
| An integrated development platform for large language models (LLMs) that provides observability, analytics, and management tools. | 7,123 |
| A fast serving framework for large language models and vision language models. | 6,551 |
| A deep learning and reinforcement learning library that provides an extensive collection of customizable neural layers to build advanced AI models quickly. | 7,337 |
| Tools and techniques for optimizing large language models on various frameworks and hardware platforms. | 2,257 |
| A Python-based framework for serving large language models with low latency and high scalability. | 2,691 |
| Builds composable LLM applications with Java | 295 |