tokenizers

Tokenizer toolkit

A toolkit providing optimized tokenizers for natural language processing tasks in various programming languages.

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

GitHub

9k stars
122 watching
815 forks
Language: Rust
last commit: about 2 months ago
Linked from 3 awesome lists

bertgptlanguage-modelnatural-language-processingnatural-language-understandingnlptransformers

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
openai/tiktoken A fast and efficient tokeniser for natural language models based on Byte Pair Encoding (BPE) 12,703
google/sentencepiece An unsupervised text tokenizer that segments input text into subwords and detokenizes output based on a predefined vocabulary size. 10,366
huggingface/text-generation-inference A toolkit for deploying and serving Large Language Models (LLMs) for high-performance text generation 9,456
huggingface/datasets A tool providing efficient data manipulation and loading for machine learning models 19,349
huggingface/transformers.js An open-source JavaScript library for running machine learning models in the browser without a server. 12,363
huggingface/trl A library designed to train transformer language models with reinforcement learning using various optimization techniques and fine-tuning methods. 10,308
arbox/tokenizer A Ruby-based library for splitting written text into tokens for natural language processing tasks. 46
languagemachines/ucto A tokeniser for natural language text that separates words from punctuation and supports basic preprocessing steps such as case changing 66
karpathy/minbpe An implementation of the Byte Pair Encoding algorithm used in language model tokenization. 9,253
jonsafari/tok-tok A fast and simple tokenizer for multiple languages 28
huggingface/peft An efficient method for fine-tuning large pre-trained models by adapting only a small fraction of their parameters 16,699
huggingface/transformers A collection of pre-trained machine learning models for various natural language and computer vision tasks, enabling developers to fine-tune and deploy these models on their own projects. 136,357
proycon/python-ucto A Python binding to an advanced, extensible tokeniser written in C++ 29
zencephalon/tactful_tokenizer A Ruby library that tokenizes text into sentences using a Bayesian statistical model 80
huggingface/alignment-handbook Provides recipes and guidelines for training language models to align with human preferences and AI goals 4,800