jieba-php
Chinese tokenizer
A PHP module for Chinese text segmentation and word breaking
"結巴"中文分詞:做最好的 PHP 中文分詞、中文斷詞組件。 / "Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best PHP Chinese word segmentation module.
1k stars
56 watching
260 forks
Language: PHP
last commit: over 2 years ago
Linked from 3 awesome lists
chinese-text-segmentationmachine-learningnatural-language-processingnlp
Related projects:
Repository | Description | Stars |
---|---|---|
mimosa/jieba-jruby | Provides a Ruby port of the popular Chinese language processing library Jieba | 8 |
452896915/jieba-android | An Android implementation of the Chinese word segmentation algorithm jieba, optimized for fast initialization and tokenization | 152 |
fangpenlin/loso | An implementation of a Chinese segmentation system using Hidden Makov Model algorithm | 83 |
xujiajun/gotokenizer | A tokenizer based on dictionary and Bigram language models for text segmentation in Chinese | 21 |
6/tiny_segmenter | A Ruby port of a Japanese text tokenization algorithm | 21 |
hit-scir/chinese-mixtral-8x7b | An implementation of a large language model for Chinese text processing, focusing on MoE (Multi-Headed Attention) architecture and incorporating a vast vocabulary. | 641 |
lichunqiang/wordcolor.php | A PHP class that generates color codes based on words | 1 |
duanhongyi/genius | A Python library implementing Conditional Random Field-based segmenter for Chinese text processing | 234 |
c4n/pythonlexto | A Python wrapper around the Thai word segmentator LexTo, allowing developers to easily integrate it into their applications. | 1 |
sinovation/zen | A pre-trained BERT-based Chinese text encoder with enhanced N-gram representations | 643 |
jiahuadong/fiss | Implementations of federated incremental semantic segmentation in PyTorch. | 33 |
jonsafari/tok-tok | A fast and simple tokenizer for multiple languages | 28 |
cebe/markdown | A fast and extensible Markdown parser for PHP | 999 |
wangwang4git/sqlite3-icu | A C-based implementation of a Chinese tokenizer for SQLite3 using ICU's Analysis feature. | 6 |
arleyguolei/wx-words-pk | A set of tools and components for building Chinese input methods, focusing on character prediction and suggestion algorithms. | 886 |