awesome-web-archiving

web archives

A curated collection of resources and tools for web archiving

An Awesome List for getting started with web archiving

GitHub

2k stars

90 watching

156 forks

last commit: over 1 year ago

Linked from 4 awesome lists

awesomeawesome-listwebarchiving

Awesome Web Archiving / Training/Documentation / Introductions to web archiving concepts:
What is a web archive?			A video from
Wikipedia's List of Web Archiving Initiatives
Glossary of Archive-It and Web Archiving Terms
The Web Archiving Lifecycle Model			The Web Archiving Lifecycle Model is an attempt to incorporate the technological and programmatic arms of the web archiving into a framework that will be relevant to any organization seeking to archive content from the web. Archive-It, the web archiving service from the Internet Archive, developed the model based on its work with memory institutions around the world
Retrieving and Archiving Information from Websites by Wael Eskandar and Brad Murray
Awesome Web Archiving / Training/Documentation / Training materials:
IIPC and DPC Training materials: module for beginners (8 sessions)
UNT Web Archiving Course	20	over 2 years ago
Continuing Education to Advance Web Archiving (CEDWARC)
A Whirlwind Tour of Common Crawl's Datasets using Python	14	over 1 year ago
Awesome Web Archiving / Training/Documentation / The WARC Standard:
warc-specifications			The community HTML version of the official specification and hub for new proposals
offical ISO 28500 WARC specification homepage			The
Awesome Web Archiving / Training/Documentation / For researchers using web archives:
GLAM Workbench: Web Archives			See also
Archives Unleashed Toolkit documentation
Tutorial for Humanities researchers about how to explore Arquivo.pt
Awesome Web Archiving / Resources for Web Publishers
Stanford Libraries' Archivability pages
Archive Ready			The tool, for estimating how likely a web page will be archived successfully
Awesome Web Archiving / Tools & Software
Comparison of web archiving software	93	almost 8 years ago
Awesome Website Change Monitoring	497	almost 4 years ago
Awesome Web Archiving / Tools & Software / Acquisition
ArchiveBox	22,669	over 1 year ago	A tool which maintains an additive archive from RSS feeds, bookmarks, and links using wget, Chrome headless, and other methods (formerly )
archivenow	409	over 2 years ago	A to push web resources into on-demand web archives
ArchiveWeb.Page			A plugin for Chrome and other Chromium based browsers that lets you interactively archive web pages, replay them, and export them as WARC & WACZ files. Also available as an Electron based desktop application
Auto Archiver	585	almost 2 years ago	Python script to automatically archive social media posts, videos, and images from a Google Sheets document. Read the
Browsertrix Crawler	677	over 1 year ago	A Chromium based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container
Brozzler	678	over 1 year ago	A distributed web crawler (爬虫) that uses a real browser (Chrome or Chromium) to fetch pages and embedded urls and to extract links
Cairn	45	over 1 year ago	A npm package and CLI tool for saving webpages
Chronicler	85	over 7 years ago	Web browser with record and replay functionality
crau	59	over 3 years ago	crau is the way (most) Brazilians pronounce crawl, it's the easiest command-line tool for archiving the Web and playing archives: you just need a list of URLs
Crawl			A simple web crawler in Golang
crocoite	43	over 6 years ago	Crawl websites using headless Google Chrome/Chromium and save resources, static DOM snapshot and page screenshots to WARC files
DiskerNet	3,797	over 1 year ago	A non-WARC-based tool which hooks into the Chrome browser and archives everything you browse making it available for offline replay
F(b)arc	77	over 8 years ago	A commandline tool and Python library for archiving data from using the
freeze-dry	272	almost 4 years ago	JavaScript library to turn page into static, self-contained HTML document; useful for browser extensions
grab-site	1,406	about 2 years ago	The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Heritrix	2,857	over 1 year ago	An open source, extensible, web-scale, archival quality web crawler
Awesome Web Archiving / Tools & Software / Acquisition / Heritrix
Heritrix Q&A	2,857	over 1 year ago	A discussion forum for asking questions and getting answers about using Heritrix
Heritrix Walkthrough	9	about 10 years ago
Awesome Web Archiving / Tools & Software / Acquisition
html2warc	18	about 3 years ago	A simple script to convert offline data into a single WARC file
HTTrack			An open source website copying utility
monolith	11,339	over 1 year ago	CLI tool to save a web page as a single HTML file
Obelisk	267	over 1 year ago	Go package and CLI tool for saving web page as single HTML file
Scoop	123	over 1 year ago	High-fidelity, browser-based, single-page web archiving library and CLI for witnessing the web
SingleFile	15,984	over 1 year ago	Browser extension for Firefox/Chrome and CLI tool to save a faithful copy of a complete page as a single HTML file
SiteStory			A transactional archive that selectively captures and stores transactions that take place between a web client (browser) and a web server
Social Feed Manager			Open source software that enables users to create social media collections from Twitter, Tumblr, Flickr, and Sina Weibo public APIs
Squidwarc	170	about 6 years ago	An archival crawler that uses Chrome or Chrome Headless directly
StormCrawler			A collection of resources for building low-latency, scalable web crawlers on Apache Storm
twarc	1,373	over 2 years ago	A command line tool and Python library for archiving Twitter JSON data
WAIL	353	almost 2 years ago	A graphical user interface (GUI) atop multiple web archiving tools intended to be used as an easy way for anyone to preserve and replay web pages; ,
Warcprox	389	over 1 year ago	WARC-writing MITM HTTP/S proxy
WARCreate			A extension for archiving an individual webpage or website to a WARC file
Warcworker	57	about 2 years ago	An open source, dockerized, queued, high fidelity web archiver based on Squidwarc with a simple web GUI
Wayback	1,839	over 1 year ago	A toolkit for snapshot webpage to Internet Archive, archive.today, IPFS and beyond
Waybackpy	489	over 2 years ago	Wayback Machine Save, CDX and availability API interface in Python and a command-line tool
Web2Warc	25	almost 9 years ago	An easy-to-use and highly customizable crawler that enables anyone to create their own little Web archives (WARC/CDX)
Web Curator Tool			Open-source workflow management for selective web archiving
WebMemex			Browser extension for Firefox and Chrome which lets you archive web pages you visit
Wget			An open source file retrieval utility that of
Wget-lua	23	over 10 years ago	Wget with Lua extension
Wpull	556	about 2 years ago	A Wget-compatible (or remake/clone/replacement/alternative) web downloader and crawler
Awesome Web Archiving / Tools & Software / Replay
InterPlanetary Wayback (ipwb)	617	over 1 year ago	Web Archive (WARC) indexing and replay using
OpenWayback	487	over 2 years ago	The open source project aimed to develop Wayback Machine, the key software used by web archives worldwide to play back archived websites in the user's browser
PYWB	1,418	over 1 year ago	A Python 3 implementation of web archival replay tools, sometimes also known as 'Wayback Machine'
Reconstructive			Reconstructive is a ServiceWorker module for client-side reconstruction of composite mementos by rerouting resource requests to corresponding archived copies (JavaScript)
ReplayWeb.page			A browser-based, fully client-side replay engine for both local and remote WARC & WACZ files. Also available as an Electron based desktop application
warc2html	41	about 2 years ago	Converts WARC files to static HTML suitable for browsing offline or rehosting
Awesome Web Archiving / Tools & Software / Search & Discovery
Mink	50	almost 2 years ago	A extension for querying Memento aggregators while browsing and integrating live-archived web navigation
playback	8	over 2 years ago	A toolkit for searching archived webpages from , , and beyond
SecurityTrails			Web based archive for WHOIS and DNS records. REST API available free of charge
Tempas v1			Temporal web archive search based on tags
Tempas v2			Temporal web archive search based on links and anchor texts extracted from the German web from 1996 to 2013 (results are not limited to German pages, e.g., )
webarchive-discovery	117	almost 2 years ago	WARC and ARC full-text indexing and discovery tools, with a number of associated tools capable of using the index shown below
Awesome Web Archiving / Tools & Software / Search & Discovery / webarchive-discovery
Shine	43	about 6 years ago	A prototype web archives exploration UI, developed with researchers as part of the
SolrWayback	102	over 1 year ago	A backend Java and frontend VUE JS project with freetext search and a build in playback engine. Require Warc files has been index with the Warc-Indexer. The web application also has a wide range of data visualization tools and data export tools that can be used on the whole webarchive. contains all the software and dependencies in an out-of-the box solution that is easy to install
Warclight	49	about 3 years ago	A Project Blacklight based Rails engine that supports the discovery of web archives held in the WARC and ARC formats
Wasp	27	almost 4 years ago	A fully functional prototype of a personal
here	117	almost 2 years ago	Other possible options for builting a front-end are listed on in the wiki,
Awesome Web Archiving / Tools & Software / Utilities
ArchiveTools	71	about 4 years ago	Collection of tools to extract and interact with WARC files (Python)
cdx-toolkit			Library and CLI to consult cdx indexes and create WARC extractions of subsets. Abstracts away Common Crawl's unusual crawl structure
Go Get Crawl	148	over 1 year ago	Extract web archive data using and
gowarcserver	15	almost 2 years ago	-based capture index (CDX) and WARC record server, used to index and serve WARC files (Go)
har2warc	48	almost 8 years ago	Convert HTTP Archive (HAR) -> Web Archive (WARC) format (Python)
httpreserve.info			Service to return the status of a web page or save it to the Internet Archive. HTTPreserve includes disambiguation of well-known short link services. It returns JSON via the browser or command line via CURL using GET. Describes web sites using earliest and latest dates in the Internet Archive and demonstrates the construction of Robust Links in its output using that range. (Golang)
HTTPreserve linkstat	10	over 1 year ago	Command line implementation of to describe the status of a web page. Can be easily scripted and provides JSON output to enable querying through tools like JQ. HTTPreserve Linkstat describes current status, and earliest and latest links on . (Golang)
Internet Archive Library	1,643	over 1 year ago	A command line tool and Python library for interacting directly with . (Python)
httrack2warc	32	almost 2 years ago	Convert HTTrack archives to WARC format (Java)
MementoMap	10	about 5 years ago	A Tool to Summarize Web Archive Holdings (Python)
MemGator	60	about 2 years ago	A Memento Aggregator CLI and Server (Golang)
node-cdxj	0	about 9 years ago	file parser (Node.js)
OutbackCDX	33	over 1 year ago	RocksDB-based capture index (CDX) server supporting incremental updates and compression. Can be used as backend for OpenWayback, PyWb and
py-wasapi-client	15	almost 7 years ago	Command line application to download crawls from WASAPI (Python)
The Archive Browser			The Archive Browser is a program that lets you browse the contents of archives, as well as extract them. It will let you open files from inside archives, and lets you preview them using Quick Look. WARC is supported (macOS only, Proprietary app)
The Unarchiver			Program to extract the contents of many archive formats, inclusive of WARC, to a file system. Free variant of The Archive Browser (macOS only, Proprietary app)
tikalinkextract	10	over 1 year ago	Extract hyperlinks as a seed for web archiving from folders of document types that can be parsed by Apache Tika (Golang, Apache Tika Server)
wasapi-downloader	6	over 1 year ago	Java command line application to download crawls from WASAPI
Warchaeology			Warchaeology is a collection of tools for inspecting, manipulating, deduplicating and validating WARC-files
warcdb	397	about 2 years ago	A command line utility (Python) for importing WARC files into a SQLite database
warcdedupe			WARC deduplication tool (and WARC library) written in Rust. (In Development)
warc-safe	11	almost 2 years ago	Automatic detection of viruses and NSFW content in WARC files
WarcPartitioner	1	over 9 years ago	Partition (W)ARC Files by MIME Type and Year
warcrefs	6	almost 8 years ago	Web archive deduplication tools
webarchive-indexing	43	over 8 years ago	Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system
wikiteam	732	almost 2 years ago	Tools for downloading and preserving wikis
Awesome Web Archiving / Tools & Software / WARC I/O Libraries
FastWARC	89	over 1 year ago	A high-performance WARC parsing library (Python)
HadoopConcatGz	9	over 8 years ago	A Splitable Hadoop InputFormat for Concatenated GZIP Files (and )
jwarc	48	over 1 year ago	Read and write WARC files with a type safe API (Java)
Jwat	3	over 2 years ago	Libraries for reading/writing/validating WARC/ARC/GZIP files (Java)
Jwat-Tools	5	over 2 years ago	Tools for reading/writing/validating WARC/ARC/GZIP files (Java)
node-warc	95	over 3 years ago	Parse WARC files or create WARC files using either or (Node.js)
Sparkling	11	over 1 year ago	Internet Archive's Sparkling Data Processing Library
Unwarcit	10	over 4 years ago	Command line interface to unzip WARC and WACZ files (Python)
Warcat	152	almost 2 years ago	Tool and library for handling Web ARChive (WARC) files (Python)
warcio	391	over 1 year ago	Streaming WARC/ARC library for fast web archive IO (Python)
warctools	153	almost 6 years ago	Library to work with ARC and WARC files (Python)
webarchive	20	over 3 years ago	Golang readers for ARC and WARC webarchive formats (Golang)
Awesome Web Archiving / Tools & Software / Analysis
Archives Research Compute Hub	15	almost 2 years ago	Web application for distributed compute analysis of Archive-It web archive collections
ArchiveSpark	145	almost 2 years ago	An Apache Spark framework (not only) for Web Archives that enables easy data processing, extraction as well as derivation
Archives Unleashed Notebooks	23	over 3 years ago	Notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archives Unleashed Toolkit
Archives Unleashed Toolkit	138	over 2 years ago	Archives Unleashed Toolkit (AUT) is an open-source platform for analyzing web archives with Apache Spark
Common Crawl Columnar Index			SQL-queryable index, with CDX info plus language classification
Common Crawl Web Graph			A host or domain-level graph of the web, with ranking information
Common Crawl Jupyter notebooks	48	about 4 years ago	A collection of notebooks using Common Crawl's various datasets
Tweet Archvies Unleashed Toolkit	9	over 1 year ago	An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark
Web Data Commons			Structured data extracted from Common Crawl
Awesome Web Archiving / Tools & Software / Quality Assurance
Chrome Check My Links			Browser extension: a link checker with more options
Chrome link checker			Browser extension: basic link checker
Chrome link gopher			Browser extension: link harvester on a page
Chrome Open Multiple URLs			Browser extension: opens multiple URLs and also extracts URLs from text
Chrome Revolver			Browser extension: switches between browser tabs
FlameShot	25,218	over 1 year ago	Screen capture and annotation on Ubuntu
PlayOnLinux			For running Xenu and Notepad++ on Ubuntu
PlayOnMac			For running Xenu and Notepad++ on macOS
Windows Snipping Tool			Windows built-in for partial screen capture and annotation. On macOS you can use Command + Shift + 4 (keyboard shortcut for taking partial screen capture)
WineBottler			For running Xenu and Notepad++ on macOS
xDoTool	3,306	almost 2 years ago	Click automation on Ubuntu
Xenu			Desktop link checker for Windows
Awesome Web Archiving / Tools & Software / Curation
Zotero Robust Links Extension			A extension that submits to and reads from web archives. Source . Supercedes
Awesome Web Archiving / Community Resources / Other Awesome Lists
Web Archiving Community	22,669	over 1 year ago
Awesome Memento	91	about 2 years ago
The WARC Ecosystem
The Web Crawl section of COPTR
Awesome Web Archiving / Community Resources / Blogs and Scholarship
IIPC Blog
Web Archiving Roundtable			Unofficial blog of the Web Archiving Roundtable of the maintained by the members of the Web Archiving Roundtable
The Web as History			An open-source book that provides a conceptual overview to web archiving research, as well as several case studies
WS-DL Blog			Web Science and Digital Libraries Research Group blogs about various Web archiving related topics, scholarly work, and academic trip reports
DSHR's Blog			David Rosenthal regularly reviews and summarizes work done in the Digital Preservation field
UK Web Archive Blog
Common Crawl Foundation Blog			-
Awesome Web Archiving / Community Resources / Mailing Lists
Common Crawl
IIPC
OpenWayback
WASAPI
Awesome Web Archiving / Community Resources / Slack
IIPC Slack			Ask for access
Archives Unleashed Slack			for access to a researcher group of people working with web archives
Archivers Slack			to a multi-disciplinary effort for archiving projects run in affiliation with and
Common Crawl Foundation Partners			(ask greg zat commoncrawl zot org for an invite)
Awesome Web Archiving / Community Resources / Twitter
@NetPreserve			Official IIPC handle
@WebSciDL			ODU Web Science and Digital Libraries Research Group
#WebArchiving
#WebArchiveWednesday
Awesome Web Archiving / Web Archiving Service Providers / Self-hostable, Open Source
Browsertrix			From , source available at
Conifer			From , source available at
Awesome Web Archiving / Web Archiving Service Providers / Hosted, Closed Source
Archive-It			From the Internet Archive
Arkiwera
Hanzo
MirrorWeb
PageFreezer
Smarsh

awesome-web-archiving

Awesome Web Archiving / Training/Documentation / Introductions to web archiving concepts:

Awesome Web Archiving / Training/Documentation / Training materials:

Awesome Web Archiving / Training/Documentation / The WARC Standard:

Awesome Web Archiving / Training/Documentation / For researchers using web archives:

Awesome Web Archiving / Resources for Web Publishers

Awesome Web Archiving / Tools & Software

Awesome Web Archiving / Tools & Software / Acquisition

Awesome Web Archiving / Tools & Software / Acquisition / Heritrix

Awesome Web Archiving / Tools & Software / Acquisition

Awesome Web Archiving / Tools & Software / Replay

Awesome Web Archiving / Tools & Software / Search & Discovery

Awesome Web Archiving / Tools & Software / Search & Discovery / webarchive-discovery

Awesome Web Archiving / Tools & Software / Utilities

Awesome Web Archiving / Tools & Software / WARC I/O Libraries

Awesome Web Archiving / Tools & Software / Analysis

Awesome Web Archiving / Tools & Software / Quality Assurance

Awesome Web Archiving / Tools & Software / Curation

Awesome Web Archiving / Community Resources / Other Awesome Lists

Awesome Web Archiving / Community Resources / Blogs and Scholarship

Awesome Web Archiving / Community Resources / Mailing Lists

Awesome Web Archiving / Community Resources / Slack

Awesome Web Archiving / Community Resources / Twitter

Awesome Web Archiving / Web Archiving Service Providers / Self-hostable, Open Source

Awesome Web Archiving / Web Archiving Service Providers / Hosted, Closed Source

Backlinks from these awesome lists:

More related projects: