tokenizers
Tokenizer toolkit
A toolkit providing optimized tokenizers for natural language processing tasks in various programming languages.
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
9k stars
121 watching
802 forks
Language: Rust
last commit: 6 days ago
Linked from 3 awesome lists
bertgptlanguage-modelnatural-language-processingnatural-language-understandingnlptransformers
Related projects:
Repository | Description | Stars |
---|---|---|
openai/tiktoken | A fast and efficient tokeniser for natural language models based on Byte Pair Encoding (BPE) | 12,420 |
google/sentencepiece | An unsupervised text tokenizer that segments input text into subwords and detokenizes output based on a predefined vocabulary size. | 10,284 |
huggingface/text-generation-inference | A toolkit for deploying and serving Large Language Models. | 9,106 |
huggingface/datasets | A tool providing efficient data manipulation and loading for machine learning models | 19,258 |
huggingface/transformers.js | An API for using pre-trained machine learning models in web browsers without the need for a server | 12,085 |
huggingface/trl | A library designed to train transformer language models with reinforcement learning using various optimization techniques and fine-tuning methods. | 10,053 |
arbox/tokenizer | A Ruby-based library for splitting written text into tokens for natural language processing tasks. | 46 |
languagemachines/ucto | A tokeniser for natural language text that separates words from punctuation and supports basic preprocessing steps such as case changing | 65 |
karpathy/minbpe | An implementation of the Byte Pair Encoding algorithm used in language model tokenization. | 9,185 |
jonsafari/tok-tok | A fast and simple tokenizer for multiple languages | 28 |
huggingface/peft | An efficient method for fine-tuning large pre-trained models by adapting only a small fraction of their parameters | 16,437 |
huggingface/transformers | A collection of pre-trained machine learning models for various natural language and computer vision tasks, enabling developers to fine-tune and deploy these models on their own projects. | 135,022 |
proycon/python-ucto | A Python binding to an advanced, extensible tokeniser written in C++ | 29 |
zencephalon/tactful_tokenizer | A Ruby library that tokenizes text into sentences using a Bayesian statistical model | 80 |
huggingface/alignment-handbook | Provides training recipes and resources to align language models with human preferences | 4,677 |