python-goose
Article Scraper
An HTML content extractor and web scraper for extracting article metadata and images from web pages
Html Content / Article Extractor, web scrapping lib in Python
4k stars
202 watching
787 forks
Language: HTML
last commit: almost 3 years ago
Linked from 1 awesome list
Related projects:
Repository | Description | Stars |
---|---|---|
goose3/goose3 | An article extraction tool that retrieves metadata and main content from web articles | 837 |
laramies/metagoofil | Extracts metadata from publicly available documents on websites | 1,037 |
foolin/pagser | A tool for automatically extracting structured data from HTML pages | 105 |
j6k4m8/goosepaper | A utility that generates and delivers a daily newspaper to an e-ink tablet based on RSS feeds, news articles, and weather data. | 271 |
unclecode/crawl4ai | A web crawling framework designed to efficiently extract structured data from the web, optimized for artificial intelligence applications. | 17,640 |
getpelican/pelican | A tool for creating and publishing static websites using Markdown and reStructuredText syntax in Python. | 12,617 |
jsvine/pdfplumber | A tool for extracting detailed information from PDFs | 6,821 |
gocolly/colly | A framework for extracting structured data from websites in a fast and elegant way | 23,387 |
needmorecowbell/giggity | A tool to scrape and store hierarchical data about GitHub organizations, users, or repositories. | 126 |
apify/crawlee | A tool for building reliable web scraping and browser automation pipelines in Node.js. | 15,845 |
geeks-of-data/knowledge-gpt | Extracts and stores information from various sources using AI models to generate answers. | 282 |
xyntopia/pydoxtools | A Python library for extracting information from unstructured documents using AI techniques and customizable pipelines. | 78 |
elliotgao2/gain | A Python web crawling framework utilizing asyncio and aiohttp for efficient data extraction from websites. | 2,035 |
gee-community/geetools | Tools for processing geospatial data using the Google Earth Engine Python API | 529 |
armbues/ioc_parser | Extracts indicators of compromise from PDF security reports | 429 |