ucto

Text tokenizer

A tokeniser for natural language text that separates words from punctuation and supports basic preprocessing steps such as case changing

Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. Ucto comes with tokenisation rules for several languages and can be easily extended to suit other languages. It has been incorporated for tokenizing Dutch text in Frog, our Dutch morpho-syntactic processor. http://ilk.uvt.nl/ucto --

GitHub

66 stars
13 watching
13 forks
Language: C++
last commit: about 1 month ago
Linked from 2 awesome lists

computational-linguisticsfolialanguagenatural-language-processingnlppunctuationtokeniser

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
proycon/python-ucto A Python binding to an advanced, extensible tokeniser written in C++ 29
c4n/pythonlexto A Python wrapper around the Thai word segmentator LexTo, allowing developers to easily integrate it into their applications. 1
juliatext/wordtokenizers.jl A set of high-performance tokenizers for natural language processing tasks 96
arbox/tokenizer A Ruby-based library for splitting written text into tokens for natural language processing tasks. 46
nytud/quntoken A C++ tokenizer that tokenizes Hungarian text 14
thisiscetin/textoken A gem for extracting words from text with customizable tokenization rules 31
denosaurs/tokenizer A simple tokenizer library for parsing and analyzing text input in various formats. 17
jonsafari/tok-tok A fast and simple tokenizer for multiple languages 28
zseder/huntoken A tool for tokenizing raw text into words and sentences in multiple languages, including Hungarian. 4
neurosnap/sentences A command line tool to split text into individual sentences 441
diasks2/pragmatic_tokenizer A multilingual tokenizer to split strings into tokens, handling various language and formatting nuances. 90
lfcipriani/punkt-segmenter A Ruby port of the NLTK algorithm to detect sentence boundaries in unstructured text 92
zencephalon/tactful_tokenizer A Ruby library that tokenizes text into sentences using a Bayesian statistical model 80
zurawiki/tiktoken-rs Provides a Rust library for tokenizing text with OpenAI models using tiktoken. 266
shonfeder/tokenize A Prolog-based tokenization library for lexing text into common tokens 11