toxy

Document extractor

A .NET framework for extracting text from various document formats across multiple platforms.

.net text extraction framework

GitHub

361 stars
39 watching
107 forks
Language: C#
last commit: about 1 month ago
Linked from 2 awesome lists


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
felipecsl/wombat A Ruby-based web crawler and data extraction tool with an elegant DSL. 1,315
ckorzen/pdf-text-extraction-benchmark Evaluates PDF extraction tools' ability to extract meaningful text from scientific articles 65
dbuenzli/uuseg An OCaml library for segmenting Unicode text into grapheme clusters, words, and sentences. 23
eyurtsev/kor Extracts structured data from unstructured text using large language models 1,629
nikolamilosevic86/tabinout A framework for extracting information from tables in scientific literature using a rule-based approach. 41
xyntopia/pydoxtools A Python library for extracting information from unstructured documents using AI techniques and customizable pipelines. 77
meilisearch/docs-scraper Automates scraping and indexing of documentation content into a search engine 290
sillsdev/standardformatlib A C# library for reading and writing files using standard format markers 0
s0rg/crawley A utility for systematically extracting URLs from web pages and printing them to the console. 265
jjelosua/doga_scraper A tool that extracts and converts Galician Official journal documents to different formats based on input year. 0
sinairv/yaxlib A flexible XML serialization library for .NET Framework and .NET Core 0
feichao93/temme A lightweight, CSS-based selector for extracting structured data from HTML documents. 273
fielddb/multilingualcorporaextractor Extracts and formats multilingual corpora from international bibles into XML, JSON, and HTML files for analysis. 0
tjatse/node-readability Automates web page scraping and text extraction to make any webpage readable 343
aymericbeaumet/squeeze A tool to extract relevant information from text 17