tokenizers

Tokenizer toolkit

A toolkit providing optimized tokenizers for natural language processing tasks in various programming languages.

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

GitHub

9k stars
121 watching
802 forks
Language: Rust
last commit: 6 days ago
Linked from 3 awesome lists

bertgptlanguage-modelnatural-language-processingnatural-language-understandingnlptransformers

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
openai/tiktoken A fast and efficient tokeniser for natural language models based on Byte Pair Encoding (BPE) 12,420
google/sentencepiece An unsupervised text tokenizer that segments input text into subwords and detokenizes output based on a predefined vocabulary size. 10,284
huggingface/text-generation-inference A toolkit for deploying and serving Large Language Models. 9,106
huggingface/datasets A tool providing efficient data manipulation and loading for machine learning models 19,258
huggingface/transformers.js An API for using pre-trained machine learning models in web browsers without the need for a server 12,085
huggingface/trl A library designed to train transformer language models with reinforcement learning using various optimization techniques and fine-tuning methods. 10,053
arbox/tokenizer A Ruby-based library for splitting written text into tokens for natural language processing tasks. 46
languagemachines/ucto A tokeniser for natural language text that separates words from punctuation and supports basic preprocessing steps such as case changing 65
karpathy/minbpe An implementation of the Byte Pair Encoding algorithm used in language model tokenization. 9,185
jonsafari/tok-tok A fast and simple tokenizer for multiple languages 28
huggingface/peft An efficient method for fine-tuning large pre-trained models by adapting only a small fraction of their parameters 16,437
huggingface/transformers A collection of pre-trained machine learning models for various natural language and computer vision tasks, enabling developers to fine-tune and deploy these models on their own projects. 135,022
proycon/python-ucto A Python binding to an advanced, extensible tokeniser written in C++ 29
zencephalon/tactful_tokenizer A Ruby library that tokenizes text into sentences using a Bayesian statistical model 80
huggingface/alignment-handbook Provides training recipes and resources to align language models with human preferences 4,677