whirlwind-python
WARC tour
Tours using Common Crawl's WARC format data to demonstrate its structure and contents
A whilrlwind tour of Common Crawl's data using Python
14 stars
9 watching
5 forks
Language: Python
last commit: 2 months ago archivepythontutorialwarc
Related projects:
Repository | Description | Stars |
---|---|---|
jakevdp/whirlwindtourofpython | An introduction to Python programming and data science | 3,743 |
commoncrawl/cc-notebooks | Analyzing and exploring Common Crawl data using Jupyter notebooks to provide insights into webarchiving and internet connections. | 48 |
internetarchive/warctools | Tools for working with archived web content | 153 |
chfoo/warcat | Tool for handling Web Archive files | 152 |
nla/httrack2warc | Converts HTTrack crawls to WARC files by reconstructing requests and responses from logs | 32 |
archiveteam/grab-site | A web crawler designed to backup websites by recursively crawling and writing WARC files. | 1,406 |
webrecorder/warcio | A fast streaming library for working with WARC format web archival data | 391 |
woocommerce/wc-api-python | A Python wrapper for interacting with WooCommerce's REST API. | 216 |
ripe-ncc/ripe-atlas-cousteau | A Python library that provides access to the RIPE ATLAS API. | 65 |
cahirwpz/demoscene | A collection of Amiga OCS demoscene-related sources and tools | 116 |
unt-libraries/py-wasapi-client | Downloads WARC files from a WASAPI access point. | 15 |
florents-tselai/warcdb | A library for storing and querying web crawl data in a compact, easily sharable format. | 397 |
wntrblm/nox | Automates testing in multiple Python environments. | 1,344 |
swaroopch/byte-of-python | A beginner's guide to the Python programming language | 2,322 |
turicas/crau | A command-line tool for archiving and playing back websites in WARC format | 59 |