ucto
Text tokenizer
A tokeniser for natural language text that separates words from punctuation and supports basic preprocessing steps such as case changing
Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. Ucto comes with tokenisation rules for several languages and can be easily extended to suit other languages. It has been incorporated for tokenizing Dutch text in Frog, our Dutch morpho-syntactic processor. http://ilk.uvt.nl/ucto --
65 stars
13 watching
13 forks
Language: C++
last commit: 4 days ago
Linked from 2 awesome lists
computational-linguisticsfolialanguagenatural-language-processingnlppunctuationtokeniser
Related projects:
Repository | Description | Stars |
---|---|---|
proycon/python-ucto | A Python binding to an advanced, extensible tokeniser written in C++ | 29 |
c4n/pythonlexto | A Python wrapper around the Thai word segmentator LexTo, allowing developers to easily integrate it into their applications. | 1 |
juliatext/wordtokenizers.jl | A set of high-performance tokenizers for natural language processing tasks | 96 |
arbox/tokenizer | A Ruby-based library for splitting written text into tokens for natural language processing tasks. | 46 |
nytud/quntoken | A C++ tokenizer that tokenizes Hungarian text | 14 |
thisiscetin/textoken | A gem for extracting words from text with customizable tokenization rules | 31 |
denosaurs/tokenizer | A simple tokenizer library for parsing and analyzing text input in various formats. | 17 |
jonsafari/tok-tok | A fast and simple tokenizer for multiple languages | 28 |
zseder/huntoken | A tool for tokenizing raw text into words and sentences in multiple languages. | 3 |
neurosnap/sentences | A command line tool to split text into individual sentences | 439 |
diasks2/pragmatic_tokenizer | A multilingual tokenizer to split strings into tokens, handling various language and formatting nuances. | 90 |
lfcipriani/punkt-segmenter | An implementation of a sentence boundary detection algorithm in Ruby. | 92 |
zencephalon/tactful_tokenizer | A Ruby library that tokenizes text into sentences using a Bayesian statistical model | 80 |
zurawiki/tiktoken-rs | Provides a Rust library for tokenizing text with OpenAI models using tiktoken. | 256 |
shonfeder/tokenize | A Prolog-based tokenization library for lexing text into common tokens | 11 |