tokenizers

Tokenizer toolkit

A toolkit providing optimized tokenizers for natural language processing tasks in various programming languages.

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

GitHub

9k stars

122 watching

815 forks

Language: Rust

last commit: 8 months ago

Linked from 3 awesome lists

bertgptlanguage-modelnatural-language-processingnatural-language-understandingnlptransformers

Screenshot of huggingface/tokenizers website

huggingface.co/docs/tokenizers

Backlinks from these awesome lists:

Related projects:

Repository	Description	Stars
openai/tiktoken	A fast and efficient tokeniser for natural language models based on Byte Pair Encoding (BPE)	12,703
google/sentencepiece	An unsupervised text tokenizer that segments input text into subwords and detokenizes output based on a predefined vocabulary size.	10,366
huggingface/text-generation-inference	A toolkit for deploying and serving Large Language Models (LLMs) for high-performance text generation	9,456
huggingface/datasets	A tool providing efficient data manipulation and loading for machine learning models	19,349
huggingface/transformers.js	An open-source JavaScript library for running machine learning models in the browser without a server.	12,363
huggingface/trl	A library designed to train transformer language models with reinforcement learning using various optimization techniques and fine-tuning methods.	10,308
arbox/tokenizer	A Ruby-based library for splitting written text into tokens for natural language processing tasks.	46
languagemachines/ucto	A tokeniser for natural language text that separates words from punctuation and supports basic preprocessing steps such as case changing	66
karpathy/minbpe	An implementation of the Byte Pair Encoding algorithm used in language model tokenization.	9,253
jonsafari/tok-tok	A fast and simple tokenizer for multiple languages	28
huggingface/peft	An efficient method for fine-tuning large pre-trained models by adapting only a small fraction of their parameters	16,699
huggingface/transformers	A collection of pre-trained machine learning models for various natural language and computer vision tasks, enabling developers to fine-tune and deploy these models on their own projects.	136,357
proycon/python-ucto	A Python binding to an advanced, extensible tokeniser written in C++	29
zencephalon/tactful_tokenizer	A Ruby library that tokenizes text into sentences using a Bayesian statistical model	80
huggingface/alignment-handbook	Provides recipes and guidelines for training language models to align with human preferences and AI goals	4,800