ucto

Text tokenizer

A tokeniser for natural language text that separates words from punctuation and supports basic preprocessing steps such as case changing

Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. Ucto comes with tokenisation rules for several languages and can be easily extended to suit other languages. It has been incorporated for tokenizing Dutch text in Frog, our Dutch morpho-syntactic processor. http://ilk.uvt.nl/ucto --

GitHub

66 stars

13 watching

13 forks

Language: C++

last commit: 11 months ago

Linked from 2 awesome lists

computational-linguisticsfolialanguagenatural-language-processingnlppunctuationtokeniser

Screenshot of LanguageMachines/ucto website

languagemachines.github.io/ucto

Backlinks from these awesome lists:

Related projects:

Repository	Description	Stars
proycon/python-ucto	A Python binding to an advanced, extensible tokeniser written in C++	29
c4n/pythonlexto	A Python wrapper around the Thai word segmentator LexTo, allowing developers to easily integrate it into their applications.	1
juliatext/wordtokenizers.jl	A set of high-performance tokenizers for natural language processing tasks	96
arbox/tokenizer	A Ruby-based library for splitting written text into tokens for natural language processing tasks.	46
nytud/quntoken	A C++ tokenizer that tokenizes Hungarian text	14
thisiscetin/textoken	A gem for extracting words from text with customizable tokenization rules	31
denosaurs/tokenizer	A simple tokenizer library for parsing and analyzing text input in various formats.	17
jonsafari/tok-tok	A fast and simple tokenizer for multiple languages	28
zseder/huntoken	A tool for tokenizing raw text into words and sentences in multiple languages, including Hungarian.	4
neurosnap/sentences	A command line tool to split text into individual sentences	441
diasks2/pragmatic_tokenizer	A multilingual tokenizer to split strings into tokens, handling various language and formatting nuances.	90
lfcipriani/punkt-segmenter	A Ruby port of the NLTK algorithm to detect sentence boundaries in unstructured text	92
zencephalon/tactful_tokenizer	A Ruby library that tokenizes text into sentences using a Bayesian statistical model	80
zurawiki/tiktoken-rs	Provides a Rust library for tokenizing text with OpenAI models using tiktoken.	266
shonfeder/tokenize	A Prolog-based tokenization library for lexing text into common tokens	11