tokenizers
Tokenizer toolkit
A toolkit providing optimized tokenizers for natural language processing tasks in various programming languages.
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
9k stars
122 watching
815 forks
Language: Rust
last commit: 3 months ago
Linked from 3 awesome lists
bertgptlanguage-modelnatural-language-processingnatural-language-understandingnlptransformers
Related projects:
Repository | Description | Stars |
---|---|---|
| A fast and efficient tokeniser for natural language models based on Byte Pair Encoding (BPE) | 12,703 |
| An unsupervised text tokenizer that segments input text into subwords and detokenizes output based on a predefined vocabulary size. | 10,366 |
| A toolkit for deploying and serving Large Language Models (LLMs) for high-performance text generation | 9,456 |
| A tool providing efficient data manipulation and loading for machine learning models | 19,349 |
| An open-source JavaScript library for running machine learning models in the browser without a server. | 12,363 |
| A library designed to train transformer language models with reinforcement learning using various optimization techniques and fine-tuning methods. | 10,308 |
| A Ruby-based library for splitting written text into tokens for natural language processing tasks. | 46 |
| A tokeniser for natural language text that separates words from punctuation and supports basic preprocessing steps such as case changing | 66 |
| An implementation of the Byte Pair Encoding algorithm used in language model tokenization. | 9,253 |
| A fast and simple tokenizer for multiple languages | 28 |
| An efficient method for fine-tuning large pre-trained models by adapting only a small fraction of their parameters | 16,699 |
| A collection of pre-trained machine learning models for various natural language and computer vision tasks, enabling developers to fine-tune and deploy these models on their own projects. | 136,357 |
| A Python binding to an advanced, extensible tokeniser written in C++ | 29 |
| A Ruby library that tokenizes text into sentences using a Bayesian statistical model | 80 |
| Provides recipes and guidelines for training language models to align with human preferences and AI goals | 4,800 |