python-goose

Article Scraper

An HTML content extractor and web scraper for extracting article metadata and images from web pages

Html Content / Article Extractor, web scrapping lib in Python

GitHub

4k stars
202 watching
786 forks
Language: HTML
last commit: about 3 years ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
goose3/goose3 An article extraction tool that retrieves metadata and main content from web articles 840
laramies/metagoofil Extracts metadata from public documents found on websites, useful for brute-force attacks. 1,050
foolin/pagser A tool for automatically extracting structured data from HTML pages 105
j6k4m8/goosepaper A utility that generates and delivers a daily newspaper to an e-ink tablet based on RSS feeds, news articles, and weather data. 274
unclecode/crawl4ai A web crawling tool designed to extract structured data from the web for use in AI applications 18,541
getpelican/pelican A tool for creating and publishing static websites using Markdown and reStructuredText syntax in Python. 12,636
jsvine/pdfplumber A tool for extracting detailed information from PDFs 6,898
gocolly/colly A framework for extracting structured data from websites in a fast and elegant way 23,444
needmorecowbell/giggity A tool to scrape and store hierarchical data about GitHub organizations, users, or repositories. 127
apify/crawlee A tool for building reliable web scraping and browser automation pipelines in Node.js. 16,081
geeks-of-data/knowledge-gpt Extracts and stores information from various sources using AI models to generate answers. 283
xyntopia/pydoxtools A Python library for extracting information from unstructured documents using AI techniques and customizable pipelines. 78
elliotgao2/gain A Python web crawling framework utilizing asyncio and aiohttp for efficient data extraction from websites. 2,037
gee-community/geetools A collection of tools and extensions to the Google Earth Engine Python API for geospatial processing 531
armbues/ioc_parser Extracts indicators of compromise from PDF security reports 430