minbpe

BPE algorithm

An implementation of the Byte Pair Encoding algorithm used in language model tokenization.

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.

GitHub

9k stars

85 watching

871 forks

Language: Python

last commit: about 1 year ago

Linked from 1 awesome list

Backlinks from these awesome lists:

amrzv/awesome-colab-notebooks

Related projects:

Repository	Description	Stars
openai/tiktoken	A fast and efficient tokeniser for natural language models based on Byte Pair Encoding (BPE)	12,703
huggingface/tokenizers	A toolkit providing optimized tokenizers for natural language processing tasks in various programming languages.	9,156
aappleby/matcheroni	A minimalist C++20 library for building pattern-matchers and parsers using Parsing Expression Grammars (PEGs)	198
ddbourgin/numpy-ml	A collection of machine learning algorithms implemented in NumPy for rapid experimentation and prototyping.	15,789
karpathy/mingpt	A minimal PyTorch implementation of a transformer-based language model	20,474
lfcipriani/punkt-segmenter	A Ruby port of the NLTK algorithm to detect sentence boundaries in unstructured text	92
babel/minify	A tool that uses Babel's compiler to achieve minification of modern JavaScript code by targeting only browsers that support newer ES features.	4,395
princeton-nlp/simcse	An open source framework for learning sentence embeddings using contrastive learning.	3,457
bytedance/byteps	A high-performance distributed deep learning framework supporting multiple frameworks and networks	3,635
javafxpert/llm-grovers-search-party	An implementation of Grover's algorithm using Qiskit and a large language model to generate boolean expressions from narratives	10
p-ranav/alpaca	A C++ serialization library that efficiently packs and unpacks structured data into compact byte arrays.	481
thunlp/plmpapers	Compiles and organizes key papers on pre-trained language models, providing a resource for developers and researchers.	3,331
google/sentencepiece	An unsupervised text tokenizer that segments input text into subwords and detokenizes output based on a predefined vocabulary size.	10,366
haskell/binary	Efficient serialisation of values to and from lazy ByteStrings in Haskell.	109
aksnzhy/xlearn	A high-performance machine learning package with linear models and factorization machines.	3,087