sentencepiece

Text segmenter

An unsupervised text tokenizer that segments input text into subwords and detokenizes output based on a predefined vocabulary size.

Unsupervised text tokenizer for Neural Network-based text generation.

GitHub

10k stars
126 watching
1k forks
Language: C++
last commit: about 2 months ago
natural-language-processingneural-machine-translationword-segmentation

Related projects:

Repository Description Stars
huggingface/tokenizers A toolkit providing optimized tokenizers for natural language processing tasks in various programming languages. 9,156
languagemachines/ucto A tokeniser for natural language text that separates words from punctuation and supports basic preprocessing steps such as case changing 66
bigscience-workshop/promptsource A toolkit for creating and using natural language prompts to enable large language models to generalize to new tasks. 2,718
neurosnap/sentences A command line tool to split text into individual sentences 441
minimaxir/textgenrnn A Python module for creating character-level or word-level neural networks for text generation and training on various datasets 4,944
brightmart/text_classification An NLP project offering various text classification models and techniques for deep learning exploration 7,881
stanfordnlp/stanza A Python library for natural language processing tasks in many human languages. 7,315
stanfordnlp/glove Provides pre-trained word vector representations and an implementation of the GloVe model for learning word embeddings 6,908
huggingface/text-generation-inference A toolkit for deploying and serving Large Language Models (LLMs) for high-performance text generation 9,456
princeton-nlp/simcse An open source framework for learning sentence embeddings using contrastive learning. 3,457
oxford-cs-deepnlp-2017/lectures An open-source repository containing lecture slides and course materials for an advanced natural language processing course. 15,702
lfcipriani/punkt-segmenter A Ruby port of the NLTK algorithm to detect sentence boundaries in unstructured text 92
zencephalon/tactful_tokenizer A Ruby library that tokenizes text into sentences using a Bayesian statistical model 80
deepseek-ai/deepseek-coder A code completion model trained on large amounts of programming language data to help developers write code more efficiently. 6,987
karpathy/neuraltalk2 Efficient image captioning model using a CNN followed by an RNN in deep learning on GPU 5,515