sentencepiece
Text segmenter
An unsupervised text tokenizer that segments input text into subwords and detokenizes output based on a predefined vocabulary size.
Unsupervised text tokenizer for Neural Network-based text generation.
10k stars
126 watching
1k forks
Language: C++
last commit: 3 months ago natural-language-processingneural-machine-translationword-segmentation
Related projects:
Repository | Description | Stars |
---|---|---|
| A toolkit providing optimized tokenizers for natural language processing tasks in various programming languages. | 9,156 |
| A tokeniser for natural language text that separates words from punctuation and supports basic preprocessing steps such as case changing | 66 |
| A toolkit for creating and using natural language prompts to enable large language models to generalize to new tasks. | 2,718 |
| A command line tool to split text into individual sentences | 441 |
| A Python module for creating character-level or word-level neural networks for text generation and training on various datasets | 4,944 |
| An NLP project offering various text classification models and techniques for deep learning exploration | 7,881 |
| A Python library for natural language processing tasks in many human languages. | 7,315 |
| Provides pre-trained word vector representations and an implementation of the GloVe model for learning word embeddings | 6,908 |
| A toolkit for deploying and serving Large Language Models (LLMs) for high-performance text generation | 9,456 |
| An open source framework for learning sentence embeddings using contrastive learning. | 3,457 |
| An open-source repository containing lecture slides and course materials for an advanced natural language processing course. | 15,702 |
| A Ruby port of the NLTK algorithm to detect sentence boundaries in unstructured text | 92 |
| A Ruby library that tokenizes text into sentences using a Bayesian statistical model | 80 |
| A code completion model trained on large amounts of programming language data to help developers write code more efficiently. | 6,987 |
| Efficient image captioning model using a CNN followed by an RNN in deep learning on GPU | 5,515 |