whirlwind-python
WARC tour
Tours using Common Crawl's WARC format data to demonstrate its structure and contents
A whilrlwind tour of Common Crawl's data using Python
12 stars
9 watching
2 forks
Language: Python
last commit: 11 days ago archivepythontutorialwarc
Related projects:
Repository | Description | Stars |
---|---|---|
jakevdp/whirlwindtourofpython | An introduction to Python programming and data science | 3,732 |
commoncrawl/cc-notebooks | Analyzing and exploring Common Crawl data using Jupyter notebooks to provide insights into webarchiving and internet connections. | 46 |
internetarchive/warctools | Tools for working with archived web content | 152 |
chfoo/warcat | Tool for handling Web Archive files | 150 |
nla/httrack2warc | Converts HTTrack crawls to WARC files by reconstructing requests and responses from logs | 30 |
archiveteam/grab-site | A web crawler designed to backup websites by recursively crawling and writing WARC files. | 1,402 |
webrecorder/warcio | A fast streaming library for working with WARC format web archival data | 385 |
woocommerce/wc-api-python | A Python wrapper for interacting with WooCommerce's REST API. | 213 |
ripe-ncc/ripe-atlas-cousteau | A Python library that provides access to the RIPE ATLAS API. | 65 |
cahirwpz/demoscene | A collection of Amiga OCS demoscene-related sources and tools | 115 |
unt-libraries/py-wasapi-client | Downloads WARC files from a WASAPI access point. | 14 |
florents-tselai/warcdb | A library for storing and querying web crawl data in a compact, easily sharable format. | 394 |
wntrblm/nox | Automates testing in multiple Python environments. | 1,333 |
swaroopch/byte-of-python | A beginner's guide to the Python programming language | 2,316 |
turicas/crau | A command-line tool for archiving and playing back websites in WARC format | 57 |