ucto

Text tokenizer

A tokeniser for natural language text that separates words from punctuation and supports basic preprocessing steps such as case changing

Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. Ucto comes with tokenisation rules for several languages and can be easily extended to suit other languages. It has been incorporated for tokenizing Dutch text in Frog, our Dutch morpho-syntactic processor. http://ilk.uvt.nl/ucto --

GitHub

65 stars
13 watching
13 forks
Language: C++
last commit: 16 days ago
Linked from 2 awesome lists

computational-linguisticsfolialanguagenatural-language-processingnlppunctuationtokeniser

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
proycon/python-ucto A Python binding to an advanced, extensible tokeniser written in C++ 29
c4n/pythonlexto A Python wrapper around the Thai word segmentator LexTo, allowing developers to easily integrate it into their applications. 1
juliatext/wordtokenizers.jl A set of high-performance tokenizers for natural language processing tasks 96
arbox/tokenizer A Ruby-based library for splitting written text into tokens for natural language processing tasks. 46
nytud/quntoken A C++ tokenizer that tokenizes Hungarian text 14
thisiscetin/textoken A gem for extracting words from text with customizable tokenization rules 31
denosaurs/tokenizer A simple tokenizer library for parsing and analyzing text input in various formats. 17
jonsafari/tok-tok A fast and simple tokenizer for multiple languages 28
zseder/huntoken A shell-based tool for breaking down raw text into words and sentences 3
neurosnap/sentences A command line tool to split text into individual sentences 440
diasks2/pragmatic_tokenizer A multilingual tokenizer to split strings into tokens, handling various language and formatting nuances. 90
lfcipriani/punkt-segmenter Port of the NLTK Punkt sentence segmentation algorithm to Ruby 92
zencephalon/tactful_tokenizer A Ruby library that tokenizes text into sentences using a Bayesian statistical model 80
zurawiki/tiktoken-rs Provides a Rust library for tokenizing text with OpenAI models using tiktoken. 261
shonfeder/tokenize A Prolog-based tokenization library for lexing text into common tokens 11