toxy

Document extractor

A .NET framework for extracting text from various document formats across multiple platforms.

.net text extraction framework

GitHub

362 stars
39 watching
107 forks
Language: C#
last commit: 4 months ago
Linked from 2 awesome lists


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
felipecsl/wombat A Ruby-based web crawler and data extraction tool with an elegant DSL. 1,315
ckorzen/pdf-text-extraction-benchmark Evaluates PDF extraction tools' ability to extract meaningful text from scientific articles 65
dbuenzli/uuseg An OCaml library for segmenting Unicode text into grapheme clusters, words, and sentences. 23
eyurtsev/kor An open-source wrapper around LLMs to extract structured data from text 1,638
nikolamilosevic86/tabinout A framework for extracting information from tables in scientific literature using a rule-based approach. 42
xyntopia/pydoxtools A Python library for extracting information from unstructured documents using AI techniques and customizable pipelines. 78
meilisearch/docs-scraper Automates scraping and indexing of documentation content into a search engine 297
sillsdev/standardformatlib A C# library for reading and writing files using standard format markers 0
s0rg/crawley A utility for systematically extracting URLs from web pages and printing them to the console. 268
jjelosua/doga_scraper A tool that extracts and converts Galician Official journal documents to different formats based on input year. 0
sinairv/yaxlib A flexible XML serialization library for .NET Framework and .NET Core 0
feichao93/temme A lightweight, CSS-based selector for extracting structured data from HTML documents. 273
fielddb/multilingualcorporaextractor Extracts and formats multilingual corpora from international bibles into XML, JSON, and HTML files for analysis. 0
tjatse/node-readability Automates web page scraping and text extraction to make any webpage readable 343
aymericbeaumet/squeeze A tool to extract relevant information from text 17