minbpe

BPE algorithm

An implementation of the Byte Pair Encoding algorithm used in language model tokenization.

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.

GitHub

9k stars
85 watching
871 forks
Language: Python
last commit: 8 months ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
openai/tiktoken A fast and efficient tokeniser for natural language models based on Byte Pair Encoding (BPE) 12,703
huggingface/tokenizers A toolkit providing optimized tokenizers for natural language processing tasks in various programming languages. 9,156
aappleby/matcheroni A minimalist C++20 library for building pattern-matchers and parsers using Parsing Expression Grammars (PEGs) 198
ddbourgin/numpy-ml A collection of machine learning algorithms implemented in NumPy for rapid experimentation and prototyping. 15,789
karpathy/mingpt A minimal PyTorch implementation of a transformer-based language model 20,474
lfcipriani/punkt-segmenter A Ruby port of the NLTK algorithm to detect sentence boundaries in unstructured text 92
babel/minify A tool that uses Babel's compiler to achieve minification of modern JavaScript code by targeting only browsers that support newer ES features. 4,395
princeton-nlp/simcse An open source framework for learning sentence embeddings using contrastive learning. 3,457
bytedance/byteps A high-performance distributed deep learning framework supporting multiple frameworks and networks 3,635
javafxpert/llm-grovers-search-party An implementation of Grover's algorithm using Qiskit and a large language model to generate boolean expressions from narratives 10
p-ranav/alpaca A C++ serialization library that efficiently packs and unpacks structured data into compact byte arrays. 481
thunlp/plmpapers Compiles and organizes key papers on pre-trained language models, providing a resource for developers and researchers. 3,331
google/sentencepiece An unsupervised text tokenizer that segments input text into subwords and detokenizes output based on a predefined vocabulary size. 10,366
haskell/binary Efficient serialisation of values to and from lazy ByteStrings in Haskell. 109
aksnzhy/xlearn A high-performance machine learning package with linear models and factorization machines. 3,087