whirlwind-python

WARC tour

Tours using Common Crawl's WARC format data to demonstrate its structure and contents

A whilrlwind tour of Common Crawl's data using Python

GitHub

14 stars
9 watching
5 forks
Language: Python
last commit: 2 months ago
archivepythontutorialwarc

Related projects:

Repository Description Stars
jakevdp/whirlwindtourofpython An introduction to Python programming and data science 3,743
commoncrawl/cc-notebooks Analyzing and exploring Common Crawl data using Jupyter notebooks to provide insights into webarchiving and internet connections. 48
internetarchive/warctools Tools for working with archived web content 153
chfoo/warcat Tool for handling Web Archive files 152
nla/httrack2warc Converts HTTrack crawls to WARC files by reconstructing requests and responses from logs 32
archiveteam/grab-site A web crawler designed to backup websites by recursively crawling and writing WARC files. 1,406
webrecorder/warcio A fast streaming library for working with WARC format web archival data 391
woocommerce/wc-api-python A Python wrapper for interacting with WooCommerce's REST API. 216
ripe-ncc/ripe-atlas-cousteau A Python library that provides access to the RIPE ATLAS API. 65
cahirwpz/demoscene A collection of Amiga OCS demoscene-related sources and tools 116
unt-libraries/py-wasapi-client Downloads WARC files from a WASAPI access point. 15
florents-tselai/warcdb A library for storing and querying web crawl data in a compact, easily sharable format. 397
wntrblm/nox Automates testing in multiple Python environments. 1,344
swaroopch/byte-of-python A beginner's guide to the Python programming language 2,322
turicas/crau A command-line tool for archiving and playing back websites in WARC format 59