html2warc

Data converter

Converts offline data into a standard archival format

simple script to convert web resources to a single warc file

GitHub

18 stars
4 watching
2 forks
Language: Python
last commit: over 1 year ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
webrecorder/har2warc Converts HTTP Archive format to Web Archive format 48
iipc/warc2html Converts WARC files to static HTML with relative link rewriting and renaming 41
nla/httrack2warc Converts HTTrack crawls to WARC files by reconstructing requests and responses from logs 32
webrecorder/warcio A fast streaming library for working with WARC format web archival data 391
internetarchive/warctools Tools for working with archived web content 153
chfoo/warcat Tool for handling Web Archive files 152
alir3z4/html2text Converts HTML to plain text that can be easily read and formatted as Markdown. 1,862
deedy5/html2text_rs Converts HTML to different formats 4
arcalex/warcrefs Tools to identify and convert duplicate records in archived web content 6
n0tan3rd/node-warc A tool for parsing and generating Web Archive files in JavaScript using Node.js 95
samboy/woff Converts TrueType font files to compressed Webfont formats for web use 25
turicas/crau A command-line tool for archiving and playing back websites in WARC format 59
florents-tselai/warcdb A library for storing and querying web crawl data in a compact, easily sharable format. 397
richardlehane/webarchive Provides tools for reading and parsing web archive formats used in digital preservation. 20
iipc/jwarc A Java library for reading and writing WARC files with a typed API 48