whirlwind-python

WARC tour

Tours using Common Crawl's WARC format data to demonstrate its structure and contents

A whilrlwind tour of Common Crawl's data using Python

GitHub

12 stars
9 watching
2 forks
Language: Python
last commit: 11 days ago
archivepythontutorialwarc

Related projects:

Repository Description Stars
jakevdp/whirlwindtourofpython An introduction to Python programming and data science 3,732
commoncrawl/cc-notebooks Analyzing and exploring Common Crawl data using Jupyter notebooks to provide insights into webarchiving and internet connections. 46
internetarchive/warctools Tools for working with archived web content 152
chfoo/warcat Tool for handling Web Archive files 150
nla/httrack2warc Converts HTTrack crawls to WARC files by reconstructing requests and responses from logs 30
archiveteam/grab-site A web crawler designed to backup websites by recursively crawling and writing WARC files. 1,402
webrecorder/warcio A fast streaming library for working with WARC format web archival data 385
woocommerce/wc-api-python A Python wrapper for interacting with WooCommerce's REST API. 213
ripe-ncc/ripe-atlas-cousteau A Python library that provides access to the RIPE ATLAS API. 65
cahirwpz/demoscene A collection of Amiga OCS demoscene-related sources and tools 115
unt-libraries/py-wasapi-client Downloads WARC files from a WASAPI access point. 14
florents-tselai/warcdb A library for storing and querying web crawl data in a compact, easily sharable format. 394
wntrblm/nox Automates testing in multiple Python environments. 1,333
swaroopch/byte-of-python A beginner's guide to the Python programming language 2,316
turicas/crau A command-line tool for archiving and playing back websites in WARC format 57