SQLite3-ICU

Chinese tokenizer

A C-based implementation of a Chinese tokenizer for SQLite3 using ICU's Analysis feature.

SQLite3 ICU Tokenizer

GitHub

6 stars
2 watching
3 forks
Language: C
last commit: over 9 years ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
illarionov/sqlite3-unicodesn An extension that adds full-text search capabilities to SQLite with Snowball stemming. 34
iwongu/sqlite3pp A C++ wrapper around the SQLite3 API to simplify its use in C++ applications. 606
xujiajun/gotokenizer A tokenizer based on dictionary and Bigram language models for text segmentation in Chinese 21
benwebber/sqlite3-uuid An extension for generating UUIDs in a SQLite database 48
gorilla/css A utility for parsing and breaking down CSS3 code into smaller components 87
sillsdev/icu-dotnet A C# wrapper for ICU4C's subset of libraries providing Unicode and Globalization support 62
c4n/pythonlexto A Python wrapper around the Thai word segmentator LexTo, allowing developers to easily integrate it into their applications. 1
abiliojr/fts5-snowball A Snowball stemmer tokenizer extension for FTS5 in SQLite 47
glzhao89/auto_taos_cfg Automates the generation of TDengine log, data, and configuration files 0
frost/isn Provides PostgreSQL type definitions and Ecto extensions for international standards in data storage 10
haifengkao/sqlitesubstringsearch A tokenizer that supports fast substring search with FTS (full text search) capabilities 83
languagemachines/ucto A tokeniser for natural language text that separates words from punctuation and supports basic preprocessing steps such as case changing 65
wangfreexx/wangfreexx-tianruoocr-cl-paddle An open-source OCR project using the PaddleOCR framework to recognize Chinese characters and text. 1,337
nytud/quntoken A C++ tokenizer that tokenizes Hungarian text 14
goodsign/icu Provides a Cgo binding to detect and convert text encoding in a Unicode-based C library 21