tokenizers
Tokenizer toolkit
A toolkit providing optimized tokenizers for natural language processing tasks in various programming languages.
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
9k stars
122 watching
815 forks
Language: Rust
last commit: about 2 months ago
Linked from 3 awesome lists
bertgptlanguage-modelnatural-language-processingnatural-language-understandingnlptransformers
Related projects:
Repository | Description | Stars |
---|---|---|
openai/tiktoken | A fast and efficient tokeniser for natural language models based on Byte Pair Encoding (BPE) | 12,703 |
google/sentencepiece | An unsupervised text tokenizer that segments input text into subwords and detokenizes output based on a predefined vocabulary size. | 10,366 |
huggingface/text-generation-inference | A toolkit for deploying and serving Large Language Models (LLMs) for high-performance text generation | 9,456 |
huggingface/datasets | A tool providing efficient data manipulation and loading for machine learning models | 19,349 |
huggingface/transformers.js | An open-source JavaScript library for running machine learning models in the browser without a server. | 12,363 |
huggingface/trl | A library designed to train transformer language models with reinforcement learning using various optimization techniques and fine-tuning methods. | 10,308 |
arbox/tokenizer | A Ruby-based library for splitting written text into tokens for natural language processing tasks. | 46 |
languagemachines/ucto | A tokeniser for natural language text that separates words from punctuation and supports basic preprocessing steps such as case changing | 66 |
karpathy/minbpe | An implementation of the Byte Pair Encoding algorithm used in language model tokenization. | 9,253 |
jonsafari/tok-tok | A fast and simple tokenizer for multiple languages | 28 |
huggingface/peft | An efficient method for fine-tuning large pre-trained models by adapting only a small fraction of their parameters | 16,699 |
huggingface/transformers | A collection of pre-trained machine learning models for various natural language and computer vision tasks, enabling developers to fine-tune and deploy these models on their own projects. | 136,357 |
proycon/python-ucto | A Python binding to an advanced, extensible tokeniser written in C++ | 29 |
zencephalon/tactful_tokenizer | A Ruby library that tokenizes text into sentences using a Bayesian statistical model | 80 |
huggingface/alignment-handbook | Provides recipes and guidelines for training language models to align with human preferences and AI goals | 4,800 |