minbpe
BPE algorithm
An implementation of the Byte Pair Encoding algorithm used in language model tokenization.
Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
9k stars
85 watching
867 forks
Language: Python
last commit: 5 months ago
Linked from 1 awesome list
Related projects:
Repository | Description | Stars |
---|---|---|
openai/tiktoken | A fast and efficient tokeniser for natural language models based on Byte Pair Encoding (BPE) | 12,420 |
huggingface/tokenizers | A toolkit providing optimized tokenizers for natural language processing tasks in various programming languages. | 9,051 |
aappleby/matcheroni | A minimalist C++20 library for building pattern-matchers and parsers using Parsing Expression Grammars (PEGs) | 198 |
ddbourgin/numpy-ml | A collection of machine learning algorithms implemented in NumPy for rapid experimentation and prototyping. | 15,466 |
karpathy/mingpt | A minimal PyTorch implementation of a transformer-based language model | 20,175 |
lfcipriani/punkt-segmenter | Port of the NLTK Punkt sentence segmentation algorithm in Ruby | 92 |
babel/minify | A tool that uses Babel's compiler to achieve minification of modern JavaScript code by targeting only browsers that support newer ES features. | 4,393 |
princeton-nlp/simcse | An open source framework for learning sentence embeddings using contrastive learning. | 3,434 |
bytedance/byteps | A high-performance distributed deep learning framework supporting multiple frameworks and networks | 3,630 |
javafxpert/llm-grovers-search-party | An implementation of Grover's algorithm using Qiskit and a large language model to generate boolean expressions from narratives | 10 |
p-ranav/alpaca | A C++ serialization library that efficiently packs and unpacks structured data into compact byte arrays. | 479 |
thunlp/plmpapers | Compiles and organizes key papers on pre-trained language models, providing a resource for developers and researchers. | 3,328 |
google/sentencepiece | An unsupervised text tokenizer that segments input text into subwords and detokenizes output based on a predefined vocabulary size. | 10,303 |
haskell/binary | Efficient serialisation of values to and from lazy ByteStrings in Haskell. | 106 |
aksnzhy/xlearn | A high-performance machine learning package with linear models and factorization machines. | 3,087 |