minbpe

BPE algorithm

An implementation of the Byte Pair Encoding algorithm used in language model tokenization.

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.

GitHub

9k stars
85 watching
867 forks
Language: Python
last commit: 5 months ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
openai/tiktoken A fast and efficient tokeniser for natural language models based on Byte Pair Encoding (BPE) 12,420
huggingface/tokenizers A toolkit providing optimized tokenizers for natural language processing tasks in various programming languages. 9,051
aappleby/matcheroni A minimalist C++20 library for building pattern-matchers and parsers using Parsing Expression Grammars (PEGs) 198
ddbourgin/numpy-ml A collection of machine learning algorithms implemented in NumPy for rapid experimentation and prototyping. 15,466
karpathy/mingpt A minimal PyTorch implementation of a transformer-based language model 20,175
lfcipriani/punkt-segmenter Port of the NLTK Punkt sentence segmentation algorithm in Ruby 92
babel/minify A tool that uses Babel's compiler to achieve minification of modern JavaScript code by targeting only browsers that support newer ES features. 4,393
princeton-nlp/simcse An open source framework for learning sentence embeddings using contrastive learning. 3,434
bytedance/byteps A high-performance distributed deep learning framework supporting multiple frameworks and networks 3,630
javafxpert/llm-grovers-search-party An implementation of Grover's algorithm using Qiskit and a large language model to generate boolean expressions from narratives 10
p-ranav/alpaca A C++ serialization library that efficiently packs and unpacks structured data into compact byte arrays. 479
thunlp/plmpapers Compiles and organizes key papers on pre-trained language models, providing a resource for developers and researchers. 3,328
google/sentencepiece An unsupervised text tokenizer that segments input text into subwords and detokenizes output based on a predefined vocabulary size. 10,303
haskell/binary Efficient serialisation of values to and from lazy ByteStrings in Haskell. 106
aksnzhy/xlearn A high-performance machine learning package with linear models and factorization machines. 3,087