python-ucto

Tokeniser library

A Python binding to an advanced, extensible tokeniser written in C++

This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is regular-expression based, extensible, and advanced tokeniser written in C++ (http://ilk.uvt.nl/ucto).

GitHub

29 stars
4 watching
5 forks
Language: Cython
last commit: 2 months ago
Linked from 2 awesome lists

computational-linguisticsfolianlpnlp-librarypythontext-processingtokenizer

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
languagemachines/ucto A tokeniser for natural language text that separates words from punctuation and supports basic preprocessing steps such as case changing 65
c4n/pythonlexto A Python wrapper around the Thai word segmentator LexTo, allowing developers to easily integrate it into their applications. 1
jonsafari/tok-tok A fast and simple tokenizer for multiple languages 28
proycon/python-frog A Python binding to a C++ NLP tool for Dutch language processing tasks 47
arbox/tokenizer A Ruby-based library for splitting written text into tokens for natural language processing tasks. 46
shonfeder/tokenize A Prolog-based tokenization library for lexing text into common tokens 11
lfcipriani/punkt-segmenter Port of the NLTK Punkt sentence segmentation algorithm in Ruby 92
proger/uk4b Develops pretraining and finetuning techniques for language models using metadata-conditioned text generation 18
nytud/quntoken A C++ tokenizer that tokenizes Hungarian text 14
taocpp/pegtl A header-only C++ library for defining and implementing parser combinators using a Parsing Expression Grammar approach 1,945
rkcosmos/deepcut A Thai word tokenization library using Deep Neural Network 420
thisiscetin/textoken A gem for extracting words from text with customizable tokenization rules 31
abitdodgy/words_counted A Ruby library that tokenizes input and provides various statistical measures about the tokens 159
denosaurs/tokenizer A simple tokenizer library for parsing and analyzing text input in various formats. 17
pfalcon/pycopy-lib Develops a minimal and lightweight Python standard library compatible with other variants and implementations of Python. 248