python-goose

Article Scraper

An HTML content extractor and web scraper for extracting article metadata and images from web pages

Html Content / Article Extractor, web scrapping lib in Python

GitHub

4k stars
202 watching
787 forks
Language: HTML
last commit: almost 3 years ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
goose3/goose3 An article extraction tool that retrieves metadata and main content from web articles 837
laramies/metagoofil Extracts metadata from publicly available documents on websites 1,037
foolin/pagser A tool for automatically extracting structured data from HTML pages 105
j6k4m8/goosepaper A utility that generates and delivers a daily newspaper to an e-ink tablet based on RSS feeds, news articles, and weather data. 271
unclecode/crawl4ai A web crawling framework designed to efficiently extract structured data from the web, optimized for artificial intelligence applications. 17,640
getpelican/pelican A tool for creating and publishing static websites using Markdown and reStructuredText syntax in Python. 12,617
jsvine/pdfplumber A tool for extracting detailed information from PDFs 6,821
gocolly/colly A framework for extracting structured data from websites in a fast and elegant way 23,387
needmorecowbell/giggity A tool to scrape and store hierarchical data about GitHub organizations, users, or repositories. 126
apify/crawlee A tool for building reliable web scraping and browser automation pipelines in Node.js. 15,845
geeks-of-data/knowledge-gpt Extracts and stores information from various sources using AI models to generate answers. 282
xyntopia/pydoxtools A Python library for extracting information from unstructured documents using AI techniques and customizable pipelines. 78
elliotgao2/gain A Python web crawling framework utilizing asyncio and aiohttp for efficient data extraction from websites. 2,035
gee-community/geetools Tools for processing geospatial data using the Google Earth Engine Python API 529
armbues/ioc_parser Extracts indicators of compromise from PDF security reports 429