sentencepiece

Text segmenter

An unsupervised text tokenizer that segments input text into subwords and detokenizes output based on a predefined vocabulary size.

Unsupervised text tokenizer for Neural Network-based text generation.

GitHub

10k stars
127 watching
1k forks
Language: C++
last commit: 20 days ago
natural-language-processingneural-machine-translationword-segmentation

Related projects:

Repository Description Stars
huggingface/tokenizers A toolkit providing optimized tokenizers for natural language processing tasks in various programming languages. 9,051
languagemachines/ucto A tokeniser for natural language text that separates words from punctuation and supports basic preprocessing steps such as case changing 65
bigscience-workshop/promptsource A toolkit for creating and using natural language prompts to enable large language models to generalize to new tasks. 2,696
neurosnap/sentences A command line tool to split text into individual sentences 439
minimaxir/textgenrnn A Python module for creating character-level or word-level neural networks for text generation and training on various datasets 4,943
brightmart/text_classification An NLP project offering various text classification models and techniques for deep learning exploration 7,861
stanfordnlp/stanza A Python library for natural language processing tasks in many human languages. 7,294
stanfordnlp/glove Provides pre-trained word vector representations and an implementation of the GloVe model for learning word embeddings 6,885
huggingface/text-generation-inference A toolkit for deploying and serving Large Language Models. 9,106
princeton-nlp/simcse An open source framework for learning sentence embeddings using contrastive learning. 3,423
oxford-cs-deepnlp-2017/lectures An open-source repository containing lecture slides and course materials for an advanced natural language processing course. 15,683
lfcipriani/punkt-segmenter An implementation of a sentence boundary detection algorithm in Ruby. 92
zencephalon/tactful_tokenizer A Ruby library that tokenizes text into sentences using a Bayesian statistical model 80
deepseek-ai/deepseek-coder A code completion model trained on large amounts of programming language data to help developers write code more efficiently. 6,837
karpathy/neuraltalk2 Efficient image captioning model using a CNN followed by an RNN in deep learning on GPU 5,511