sentencepiece
Text segmenter
An unsupervised text tokenizer that segments input text into subwords and detokenizes output based on a predefined vocabulary size.
Unsupervised text tokenizer for Neural Network-based text generation.
10k stars
127 watching
1k forks
Language: C++
last commit: 20 days ago natural-language-processingneural-machine-translationword-segmentation
Related projects:
Repository | Description | Stars |
---|---|---|
huggingface/tokenizers | A toolkit providing optimized tokenizers for natural language processing tasks in various programming languages. | 9,051 |
languagemachines/ucto | A tokeniser for natural language text that separates words from punctuation and supports basic preprocessing steps such as case changing | 65 |
bigscience-workshop/promptsource | A toolkit for creating and using natural language prompts to enable large language models to generalize to new tasks. | 2,696 |
neurosnap/sentences | A command line tool to split text into individual sentences | 439 |
minimaxir/textgenrnn | A Python module for creating character-level or word-level neural networks for text generation and training on various datasets | 4,943 |
brightmart/text_classification | An NLP project offering various text classification models and techniques for deep learning exploration | 7,861 |
stanfordnlp/stanza | A Python library for natural language processing tasks in many human languages. | 7,294 |
stanfordnlp/glove | Provides pre-trained word vector representations and an implementation of the GloVe model for learning word embeddings | 6,885 |
huggingface/text-generation-inference | A toolkit for deploying and serving Large Language Models. | 9,106 |
princeton-nlp/simcse | An open source framework for learning sentence embeddings using contrastive learning. | 3,423 |
oxford-cs-deepnlp-2017/lectures | An open-source repository containing lecture slides and course materials for an advanced natural language processing course. | 15,683 |
lfcipriani/punkt-segmenter | An implementation of a sentence boundary detection algorithm in Ruby. | 92 |
zencephalon/tactful_tokenizer | A Ruby library that tokenizes text into sentences using a Bayesian statistical model | 80 |
deepseek-ai/deepseek-coder | A code completion model trained on large amounts of programming language data to help developers write code more efficiently. | 6,837 |
karpathy/neuraltalk2 | Efficient image captioning model using a CNN followed by an RNN in deep learning on GPU | 5,511 |