webcorpus

Text processor

A collection of scripts and programs for processing crawled data into a usable text corpus.

webcorpus pipeline

GitHub

8 stars
4 watching
0 forks
Language: C++
last commit: over 9 years ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
ericzimmerman/bstrings A utility for searching and processing strings in various formats and encodings. 121
senselogic/pendown A text-to-HTML conversion tool with integrated styling and tag customization 49
wooorm/dioscuri A tool for parsing and transforming text formats used in online communication 41
kzykhys/text A simple text manipulation library with a fluent interface. 53
eliaskosunen/scnlib A modern C++ library for safer and more efficient input parsing. 1,098
esemplastic/unis A common architecture for string utilities in the Go programming language 70
nysol/mcmd A set of commands for high-speed processing of large-scale CSV data 33
gagolews/stringi A package providing a fast and portable way to process character strings with Unicode support 306
spreads/spreads A high-performance library for real-time data processing and time series manipulation 430
zepgram/module-multi-threading A module that enables parallel processing of large data sets in Magento 2 using multiple child processes. 80
ziglibs/fontaine A text rendering library providing basic font layouting and glyph information for rendering text in arbitrary contexts. 34
cpitclaudel/alectryon A tool for processing Coq and Lean 4 code embedded in text documents 237
zix99/rare A tool that provides fast and efficient text analysis and visualization capabilities 275
semiversus/python-broqer A reactive data processing library with publish-subscribe functionality and asyncio support. 74
ezrosent/frawk A small programming language for processing textual data with improved performance compared to AWK. 1,256