awesome-web-archiving
web archives
A curated collection of resources and tools for web archiving
An Awesome List for getting started with web archiving
2k stars
90 watching
156 forks
last commit: 15 days ago
Linked from 4 awesome lists
awesomeawesome-listwebarchiving
Awesome Web Archiving / Training/Documentation / Introductions to web archiving concepts: | |||
What is a web archive? | A video from | ||
Wikipedia's List of Web Archiving Initiatives | |||
Glossary of Archive-It and Web Archiving Terms | |||
The Web Archiving Lifecycle Model | The Web Archiving Lifecycle Model is an attempt to incorporate the technological and programmatic arms of the web archiving into a framework that will be relevant to any organization seeking to archive content from the web. Archive-It, the web archiving service from the Internet Archive, developed the model based on its work with memory institutions around the world | ||
Retrieving and Archiving Information from Websites by Wael Eskandar and Brad Murray | |||
Awesome Web Archiving / Training/Documentation / Training materials: | |||
IIPC and DPC Training materials: module for beginners (8 sessions) | |||
UNT Web Archiving Course | 20 | 9 months ago | |
Continuing Education to Advance Web Archiving (CEDWARC) | |||
A Whirlwind Tour of Common Crawl's Datasets using Python | 12 | 9 days ago | |
Awesome Web Archiving / Training/Documentation / The WARC Standard: | |||
warc-specifications | The community HTML version of the official specification and hub for new proposals | ||
offical ISO 28500 WARC specification homepage | The | ||
Awesome Web Archiving / Training/Documentation / For researchers using web archives: | |||
GLAM Workbench: Web Archives | See also | ||
Archives Unleashed Toolkit documentation | |||
Tutorial for Humanities researchers about how to explore Arquivo.pt | |||
Awesome Web Archiving / Resources for Web Publishers | |||
Stanford Libraries' Archivability pages | |||
Archive Ready | The tool, for estimating how likely a web page will be archived successfully | ||
Awesome Web Archiving / Tools & Software | |||
Comparison of web archiving software | 92 | about 6 years ago | |
Awesome Website Change Monitoring | 495 | over 2 years ago | |
Awesome Web Archiving / Tools & Software / Acquisition | |||
ArchiveBox | 22,341 | 5 days ago | A tool which maintains an additive archive from RSS feeds, bookmarks, and links using wget, Chrome headless, and other methods (formerly ) |
archivenow | 410 | 10 months ago | A to push web resources into on-demand web archives |
ArchiveWeb.Page | A plugin for Chrome and other Chromium based browsers that lets you interactively archive web pages, replay them, and export them as WARC & WACZ files. Also available as an Electron based desktop application | ||
Auto Archiver | 570 | about 2 months ago | Python script to automatically archive social media posts, videos, and images from a Google Sheets document. Read the |
Browsertrix Crawler | 652 | 7 days ago | A Chromium based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container |
Brozzler | 671 | 9 days ago | A distributed web crawler (爬虫) that uses a real browser (Chrome or Chromium) to fetch pages and embedded urls and to extract links |
Cairn | 43 | 15 days ago | A npm package and CLI tool for saving webpages |
Chronicler | 84 | almost 6 years ago | Web browser with record and replay functionality |
crau | 57 | almost 2 years ago | crau is the way (most) Brazilians pronounce crawl, it's the easiest command-line tool for archiving the Web and playing archives: you just need a list of URLs |
Crawl | A simple web crawler in Golang | ||
crocoite | 42 | almost 5 years ago | Crawl websites using headless Google Chrome/Chromium and save resources, static DOM snapshot and page screenshots to WARC files |
DiskerNet | 3,784 | 19 days ago | A non-WARC-based tool which hooks into the Chrome browser and archives everything you browse making it available for offline replay |
F(b)arc | 77 | almost 7 years ago | A commandline tool and Python library for archiving data from using the |
freeze-dry | 271 | about 2 years ago | JavaScript library to turn page into static, self-contained HTML document; useful for browser extensions |
grab-site | 1,398 | 5 months ago | The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns |
Heritrix | 2,833 | 14 days ago | An open source, extensible, web-scale, archival quality web crawler |
Awesome Web Archiving / Tools & Software / Acquisition / Heritrix | |||
Heritrix Q&A | 2,833 | 14 days ago | A discussion forum for asking questions and getting answers about using Heritrix |
Heritrix Walkthrough | 9 | over 8 years ago | |
Awesome Web Archiving / Tools & Software / Acquisition | |||
html2warc | 18 | over 1 year ago | A simple script to convert offline data into a single WARC file |
HTTrack | An open source website copying utility | ||
monolith | 11,218 | about 2 months ago | CLI tool to save a web page as a single HTML file |
Obelisk | 263 | 20 days ago | Go package and CLI tool for saving web page as single HTML file |
Scoop | 117 | 9 days ago | High-fidelity, browser-based, single-page web archiving library and CLI for witnessing the web |
SingleFile | 15,680 | 8 days ago | Browser extension for Firefox/Chrome and CLI tool to save a faithful copy of a complete page as a single HTML file |
SiteStory | A transactional archive that selectively captures and stores transactions that take place between a web client (browser) and a web server | ||
Social Feed Manager | Open source software that enables users to create social media collections from Twitter, Tumblr, Flickr, and Sina Weibo public APIs | ||
Squidwarc | 169 | over 4 years ago | An archival crawler that uses Chrome or Chrome Headless directly |
StormCrawler | A collection of resources for building low-latency, scalable web crawlers on Apache Storm | ||
twarc | 1,370 | about 1 year ago | A command line tool and Python library for archiving Twitter JSON data |
WAIL | 350 | about 2 months ago | A graphical user interface (GUI) atop multiple web archiving tools intended to be used as an easy way for anyone to preserve and replay web pages; , |
Warcprox | 381 | 15 days ago | WARC-writing MITM HTTP/S proxy |
WARCreate | A extension for archiving an individual webpage or website to a WARC file | ||
Warcworker | 55 | 5 months ago | An open source, dockerized, queued, high fidelity web archiver based on Squidwarc with a simple web GUI |
Wayback | 1,811 | 8 days ago | A toolkit for snapshot webpage to Internet Archive, archive.today, IPFS and beyond |
Waybackpy | 479 | 9 months ago | Wayback Machine Save, CDX and availability API interface in Python and a command-line tool |
Web2Warc | 24 | about 7 years ago | An easy-to-use and highly customizable crawler that enables anyone to create their own little Web archives (WARC/CDX) |
Web Curator Tool | Open-source workflow management for selective web archiving | ||
WebMemex | Browser extension for Firefox and Chrome which lets you archive web pages you visit | ||
Wget | An open source file retrieval utility that of | ||
Wget-lua | 23 | almost 9 years ago | Wget with Lua extension |
Wpull | 556 | 7 months ago | A Wget-compatible (or remake/clone/replacement/alternative) web downloader and crawler |
Awesome Web Archiving / Tools & Software / Replay | |||
InterPlanetary Wayback (ipwb) | 617 | 7 days ago | Web Archive (WARC) indexing and replay using |
OpenWayback | 486 | 11 months ago | The open source project aimed to develop Wayback Machine, the key software used by web archives worldwide to play back archived websites in the user's browser |
PYWB | 1,407 | 8 days ago | A Python 3 implementation of web archival replay tools, sometimes also known as 'Wayback Machine' |
Reconstructive | Reconstructive is a ServiceWorker module for client-side reconstruction of composite mementos by rerouting resource requests to corresponding archived copies (JavaScript) | ||
ReplayWeb.page | A browser-based, fully client-side replay engine for both local and remote WARC & WACZ files. Also available as an Electron based desktop application | ||
warc2html | 39 | 5 months ago | Converts WARC files to static HTML suitable for browsing offline or rehosting |
Awesome Web Archiving / Tools & Software / Search & Discovery | |||
Mink | 49 | about 1 month ago | A extension for querying Memento aggregators while browsing and integrating live-archived web navigation |
playback | 6 | 7 months ago | A toolkit for searching archived webpages from , , and beyond |
SecurityTrails | Web based archive for WHOIS and DNS records. REST API available free of charge | ||
Tempas v1 | Temporal web archive search based on tags | ||
Tempas v2 | Temporal web archive search based on links and anchor texts extracted from the German web from 1996 to 2013 (results are not limited to German pages, e.g., ) | ||
webarchive-discovery | 116 | 3 months ago | WARC and ARC full-text indexing and discovery tools, with a number of associated tools capable of using the index shown below |
Awesome Web Archiving / Tools & Software / Search & Discovery / webarchive-discovery | |||
Shine | 43 | over 4 years ago | A prototype web archives exploration UI, developed with researchers as part of the |
SolrWayback | 102 | 10 days ago | A backend Java and frontend VUE JS project with freetext search and a build in playback engine. Require Warc files has been index with the Warc-Indexer. The web application also has a wide range of data visualization tools and data export tools that can be used on the whole webarchive. contains all the software and dependencies in an out-of-the box solution that is easy to install |
Warclight | 49 | over 1 year ago | A Project Blacklight based Rails engine that supports the discovery of web archives held in the WARC and ARC formats |
Wasp | 26 | about 2 years ago | A fully functional prototype of a personal |
here | 116 | 3 months ago | Other possible options for builting a front-end are listed on in the wiki, |
Awesome Web Archiving / Tools & Software / Utilities | |||
ArchiveTools | 69 | over 2 years ago | Collection of tools to extract and interact with WARC files (Python) |
cdx-toolkit | Library and CLI to consult cdx indexes and create WARC extractions of subsets. Abstracts away Common Crawl's unusual crawl structure | ||
Go Get Crawl | 147 | 17 days ago | Extract web archive data using and |
gowarcserver | 14 | about 1 month ago | -based capture index (CDX) and WARC record server, used to index and serve WARC files (Go) |
har2warc | 46 | about 6 years ago | Convert HTTP Archive (HAR) -> Web Archive (WARC) format (Python) |
httpreserve.info | Service to return the status of a web page or save it to the Internet Archive. HTTPreserve includes disambiguation of well-known short link services. It returns JSON via the browser or command line via CURL using GET. Describes web sites using earliest and latest dates in the Internet Archive and demonstrates the construction of Robust Links in its output using that range. (Golang) | ||
HTTPreserve linkstat | 9 | 6 days ago | Command line implementation of to describe the status of a web page. Can be easily scripted and provides JSON output to enable querying through tools like JQ. HTTPreserve Linkstat describes current status, and earliest and latest links on . (Golang) |
Internet Archive Library | 1,625 | 6 days ago | A command line tool and Python library for interacting directly with . (Python) |
httrack2warc | 30 | 4 months ago | Convert HTTrack archives to WARC format (Java) |
MementoMap | 10 | over 3 years ago | A Tool to Summarize Web Archive Holdings (Python) |
MemGator | 57 | 6 months ago | A Memento Aggregator CLI and Server (Golang) |
node-cdxj | 0 | over 7 years ago | file parser (Node.js) |
OutbackCDX | 32 | about 1 month ago | RocksDB-based capture index (CDX) server supporting incremental updates and compression. Can be used as backend for OpenWayback, PyWb and |
py-wasapi-client | 14 | about 5 years ago | Command line application to download crawls from WASAPI (Python) |
The Archive Browser | The Archive Browser is a program that lets you browse the contents of archives, as well as extract them. It will let you open files from inside archives, and lets you preview them using Quick Look. WARC is supported (macOS only, Proprietary app) | ||
The Unarchiver | Program to extract the contents of many archive formats, inclusive of WARC, to a file system. Free variant of The Archive Browser (macOS only, Proprietary app) | ||
tikalinkextract | 9 | 14 days ago | Extract hyperlinks as a seed for web archiving from folders of document types that can be parsed by Apache Tika (Golang, Apache Tika Server) |
wasapi-downloader | 6 | 10 days ago | Java command line application to download crawls from WASAPI |
Warchaeology | Warchaeology is a collection of tools for inspecting, manipulating, deduplicating and validating WARC-files | ||
warcdb | 394 | 4 months ago | A command line utility (Python) for importing WARC files into a SQLite database |
warcdedupe | WARC deduplication tool (and WARC library) written in Rust. (In Development) | ||
warc-safe | 10 | 3 months ago | Automatic detection of viruses and NSFW content in WARC files |
WarcPartitioner | 1 | almost 8 years ago | Partition (W)ARC Files by MIME Type and Year |
warcrefs | 6 | about 6 years ago | Web archive deduplication tools |
webarchive-indexing | 42 | almost 7 years ago | Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system |
wikiteam | 729 | 3 months ago | Tools for downloading and preserving wikis |
Awesome Web Archiving / Tools & Software / WARC I/O Libraries | |||
FastWARC | 84 | 3 months ago | A high-performance WARC parsing library (Python) |
HadoopConcatGz | 9 | almost 7 years ago | A Splitable Hadoop InputFormat for Concatenated GZIP Files (and ) |
jwarc | 47 | 7 days ago | Read and write WARC files with a type safe API (Java) |
Jwat | 3 | 11 months ago | Libraries for reading/writing/validating WARC/ARC/GZIP files (Java) |
Jwat-Tools | 5 | 11 months ago | Tools for reading/writing/validating WARC/ARC/GZIP files (Java) |
node-warc | 94 | almost 2 years ago | Parse WARC files or create WARC files using either or (Node.js) |
Sparkling | 11 | about 1 month ago | Internet Archive's Sparkling Data Processing Library |
Unwarcit | 8 | almost 3 years ago | Command line interface to unzip WARC and WACZ files (Python) |
Warcat | 150 | about 1 month ago | Tool and library for handling Web ARChive (WARC) files (Python) |
warcio | 385 | 9 days ago | Streaming WARC/ARC library for fast web archive IO (Python) |
warctools | 152 | about 4 years ago | Library to work with ARC and WARC files (Python) |
webarchive | 20 | over 1 year ago | Golang readers for ARC and WARC webarchive formats (Golang) |
Awesome Web Archiving / Tools & Software / Analysis | |||
Archives Research Compute Hub | 15 | 3 months ago | Web application for distributed compute analysis of Archive-It web archive collections |
ArchiveSpark | 145 | 2 months ago | An Apache Spark framework (not only) for Web Archives that enables easy data processing, extraction as well as derivation |
Archives Unleashed Notebooks | 22 | almost 2 years ago | Notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archives Unleashed Toolkit |
Archives Unleashed Toolkit | 137 | 9 months ago | Archives Unleashed Toolkit (AUT) is an open-source platform for analyzing web archives with Apache Spark |
Common Crawl Columnar Index | SQL-queryable index, with CDX info plus language classification | ||
Common Crawl Web Graph | A host or domain-level graph of the web, with ranking information | ||
Common Crawl Jupyter notebooks | 46 | over 2 years ago | A collection of notebooks using Common Crawl's various datasets |
Tweet Archvies Unleashed Toolkit | 9 | over 1 year ago | An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark |
Web Data Commons | Structured data extracted from Common Crawl | ||
Awesome Web Archiving / Tools & Software / Quality Assurance | |||
Chrome Check My Links | Browser extension: a link checker with more options | ||
Chrome link checker | Browser extension: basic link checker | ||
Chrome link gopher | Browser extension: link harvester on a page | ||
Chrome Open Multiple URLs | Browser extension: opens multiple URLs and also extracts URLs from text | ||
Chrome Revolver | Browser extension: switches between browser tabs | ||
FlameShot | 25,024 | 11 days ago | Screen capture and annotation on Ubuntu |
PlayOnLinux | For running Xenu and Notepad++ on Ubuntu | ||
PlayOnMac | For running Xenu and Notepad++ on macOS | ||
Windows Snipping Tool | Windows built-in for partial screen capture and annotation. On macOS you can use Command + Shift + 4 (keyboard shortcut for taking partial screen capture) | ||
WineBottler | For running Xenu and Notepad++ on macOS | ||
xDoTool | 3,264 | about 1 month ago | Click automation on Ubuntu |
Xenu | Desktop link checker for Windows | ||
Awesome Web Archiving / Tools & Software / Curation | |||
Zotero Robust Links Extension | A extension that submits to and reads from web archives. Source . Supercedes | ||
Awesome Web Archiving / Community Resources / Other Awesome Lists | |||
Web Archiving Community | 22,341 | 5 days ago | |
Awesome Memento | 88 | 6 months ago | |
The WARC Ecosystem | |||
The Web Crawl section of COPTR | |||
Awesome Web Archiving / Community Resources / Blogs and Scholarship | |||
IIPC Blog | |||
Web Archiving Roundtable | Unofficial blog of the Web Archiving Roundtable of the maintained by the members of the Web Archiving Roundtable | ||
The Web as History | An open-source book that provides a conceptual overview to web archiving research, as well as several case studies | ||
WS-DL Blog | Web Science and Digital Libraries Research Group blogs about various Web archiving related topics, scholarly work, and academic trip reports | ||
DSHR's Blog | David Rosenthal regularly reviews and summarizes work done in the Digital Preservation field | ||
UK Web Archive Blog | |||
Common Crawl Foundation Blog | - | ||
Awesome Web Archiving / Community Resources / Mailing Lists | |||
Common Crawl | |||
IIPC | |||
OpenWayback | |||
WASAPI | |||
Awesome Web Archiving / Community Resources / Slack | |||
IIPC Slack | Ask for access | ||
Archives Unleashed Slack | for access to a researcher group of people working with web archives | ||
Archivers Slack | to a multi-disciplinary effort for archiving projects run in affiliation with and | ||
Common Crawl Foundation Partners | (ask greg zat commoncrawl zot org for an invite) | ||
Awesome Web Archiving / Community Resources / Twitter | |||
@NetPreserve | Official IIPC handle | ||
@WebSciDL | ODU Web Science and Digital Libraries Research Group | ||
#WebArchiving | |||
#WebArchiveWednesday | |||
Awesome Web Archiving / Web Archiving Service Providers / Self-hostable, Open Source | |||
Browsertrix | From , source available at | ||
Conifer | From , source available at | ||
Awesome Web Archiving / Web Archiving Service Providers / Hosted, Closed Source | |||
Archive-It | From the Internet Archive | ||
Arkiwera | |||
Hanzo | |||
MirrorWeb | |||
PageFreezer | |||
Smarsh |
More related projects:
- iprit/md-svg-vue
- reimunotmoe/ydotool
- importre/alfred-mdi
- dunest/materialdesignlite
- home-assistant/iconic
- nt1m/material-framework
- paragasu/lua-resty-couchdb
- fate0/pychrome
- the-markup/blacklight-collector
- everettss/puppeteer-har
- mainmatter/breethe-client
- alex7kom/node-steam-tradeoffers
- amoilanen/js-crawler