WarcPartitioner

ARC file processor

Tool for partitioning and merging Web archive files by MIME type and year

Partition (W)ARC Files by MIME Type and Year

GitHub

1 stars
2 watching
1 forks
Language: Java
last commit: almost 8 years ago
Linked from 1 awesome list

hadoopwarcweb-archivingwebarchive

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
ikreymer/webarchive-indexing Tools for bulk indexing of WARC/ARC files to create a shared url index 42
helgeho/web2warc A Web crawler that creates custom archives in WARC/CDX format 24
arcalex/warcrefs Tools to identify and convert duplicate records in archived web content 6
webrecorder/warcio A fast streaming library for working with WARC format web archival data 385
richardlehane/webarchive Provides tools for reading and parsing web archive formats used in digital preservation. 20
n0tan3rd/node-warc A tool for parsing and generating Web Archive files in JavaScript using Node.js 94
chfoo/warcat Tool for handling Web Archive files 150
ukwa/webarchive-discovery Tools for indexing and discovering archived web content 116
internetarchive/warctools Tools for working with archived web content 152
webrecorder/har2warc Converts HTTP Archive format to Web Archive format 46
helgeho/archivespark A framework for efficient data processing and extraction from archival collections, enabling the transformation of raw data into more accessible formats. 145
nla/httrack2warc Converts HTTrack crawls to WARC files by reconstructing requests and responses from logs 30
peterk/warcworker A web archiving tool that archives websites with high-fidelity preservation capabilities. 55
lord/wargo An tool for easy compilation and testing of Rust applications on WebAssembly platforms. 261
archiveteam/grab-site A web crawler designed to backup websites by recursively crawling and writing WARC files. 1,398