python-ucto

Tokeniser library

A Python binding to an advanced, extensible tokeniser written in C++

This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is regular-expression based, extensible, and advanced tokeniser written in C++ (http://ilk.uvt.nl/ucto).

GitHub

29 stars

4 watching

5 forks

Language: Cython

last commit: 8 months ago

Linked from 2 awesome lists

computational-linguisticsfolianlpnlp-librarypythontext-processingtokenizer

Backlinks from these awesome lists:

Related projects:

Repository	Description	Stars
languagemachines/ucto	A tokeniser for natural language text that separates words from punctuation and supports basic preprocessing steps such as case changing	66
c4n/pythonlexto	A Python wrapper around the Thai word segmentator LexTo, allowing developers to easily integrate it into their applications.	1
jonsafari/tok-tok	A fast and simple tokenizer for multiple languages	28
proycon/python-frog	A Python binding to a C++ NLP tool for Dutch language processing tasks	47
arbox/tokenizer	A Ruby-based library for splitting written text into tokens for natural language processing tasks.	46
shonfeder/tokenize	A Prolog-based tokenization library for lexing text into common tokens	11
lfcipriani/punkt-segmenter	A Ruby port of the NLTK algorithm to detect sentence boundaries in unstructured text	92
proger/uk4b	Develops pretraining and finetuning techniques for language models using metadata-conditioned text generation	18
nytud/quntoken	A C++ tokenizer that tokenizes Hungarian text	14
taocpp/pegtl	A header-only C++ library for defining and implementing parser combinators using a Parsing Expression Grammar approach	1,957
rkcosmos/deepcut	A Thai word tokenization library using Deep Neural Network	421
thisiscetin/textoken	A gem for extracting words from text with customizable tokenization rules	31
abitdodgy/words_counted	A Ruby library that tokenizes input and provides various statistical measures about the tokens	159
denosaurs/tokenizer	A simple tokenizer library for parsing and analyzing text input in various formats.	17
pfalcon/pycopy-lib	Develops a minimal and lightweight Python standard library compatible with other variants and implementations of Python.	249