gotokenizer

Chinese Tokenizer Library

A tokenizer based on dictionary and Bigram language models for text segmentation in Chinese

A tokenizer based on the dictionary and Bigram language models for Go. (Now only support chinese segmentation)

GitHub

21 stars
3 watching
7 forks
Language: Go
last commit: over 5 years ago
Linked from 2 awesome lists

golangsegmentationtokenizer

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
bzick/tokenizer A high-performance tokenization library for Go, capable of parsing various data formats and syntaxes. 98
fangpenlin/loso An implementation of a Chinese segmentation system using Hidden Makov Model algorithm 83
jonsafari/tok-tok A fast and simple tokenizer for multiple languages 28
thisiscetin/textoken A gem for extracting words from text with customizable tokenization rules 31
xujiajun/gorouter A fast and feature-rich HTTP router for Go that supports regular expressions. 533
fukuball/jieba-php A PHP module for Chinese text segmentation and word breaking 1,323
mimosa/jieba-jruby Provides a Ruby port of the popular Chinese language processing library Jieba 8
zencephalon/tactful_tokenizer A Ruby library that tokenizes text into sentences using a Bayesian statistical model 80
xujiajun/pattern-guidance A comprehensive guide to design patterns in Go programming language 268
diasks2/pragmatic_tokenizer A multilingual tokenizer to split strings into tokens, handling various language and formatting nuances. 90
6/tiny_segmenter A Ruby port of a Japanese text tokenization algorithm 21
abitdodgy/words_counted A Ruby library that tokenizes input and provides various statistical measures about the tokens 159
zseder/huntoken A tool for tokenizing raw text into words and sentences in multiple languages. 3
arbox/tokenizer A Ruby-based library for splitting written text into tokens for natural language processing tasks. 46
tiancaiamao/shen-go A Go implementation of Shen, a portable functional programming language with features like pattern matching and macro support. 55