TensorRT-LLM

Inference optimizer

A software framework providing an easy-to-use Python API to optimize Large Language Models on NVIDIA GPUs for efficient inference.

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

GitHub

9k stars
95 watching
1k forks
Language: C++
last commit: about 1 month ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
tensorzero/tensorzero A tool for optimizing large language models by collecting feedback and metrics to improve their performance over time 1,245
nvidia/tensorrt Provides a set of tools and libraries for optimizing deep learning inference on NVIDIA GPUs. 10,926
mlc-ai/mlc-llm A machine learning compiler and deployment engine for large language models 19,396
nvidia/fastertransformer A high-performance transformer-based NLP component optimized for GPU acceleration and integration into various frameworks. 5,937
microsoft/deepspeed A deep learning optimization library that simplifies distributed training and inference on modern computing hardware. 35,863
linkedin/liger-kernel A collection of optimized kernels and post-training loss functions for large language models 3,840
internlm/lmdeploy A toolkit for optimizing and serving large language models 4,854
sjtu-ipads/powerinfer An efficient Large Language Model inference engine leveraging consumer-grade GPUs on PCs 8,011
lifanghe/neurips18_surf A toolbox implementing a sparse and low-rank tensor regression algorithm with boosting 12
langfuse/langfuse An integrated development platform for large language models (LLMs) that provides observability, analytics, and management tools. 7,123
sgl-project/sglang A fast serving framework for large language models and vision language models. 6,551
tensorlayer/tensorlayer A deep learning and reinforcement learning library that provides an extensive collection of customizable neural layers to build advanced AI models quickly. 7,337
intel/neural-compressor Tools and techniques for optimizing large language models on various frameworks and hardware platforms. 2,257
modeltc/lightllm A Python-based framework for serving large language models with low latency and high scalability. 2,691
li2109/langtorch Builds composable LLM applications with Java 295