WarcPartitioner
ARC file processor
Tool for partitioning and merging Web archive files by MIME type and year
Partition (W)ARC Files by MIME Type and Year
1 stars
2 watching
1 forks
Language: Java
last commit: almost 8 years ago
Linked from 1 awesome list
hadoopwarcweb-archivingwebarchive
Related projects:
Repository | Description | Stars |
---|---|---|
ikreymer/webarchive-indexing | Tools for bulk indexing of WARC/ARC files to create a shared url index | 42 |
helgeho/web2warc | A Web crawler that creates custom archives in WARC/CDX format | 24 |
arcalex/warcrefs | Tools to identify and convert duplicate records in archived web content | 6 |
webrecorder/warcio | A fast streaming library for working with WARC format web archival data | 385 |
richardlehane/webarchive | Provides tools for reading and parsing web archive formats used in digital preservation. | 20 |
n0tan3rd/node-warc | A tool for parsing and generating Web Archive files in JavaScript using Node.js | 94 |
chfoo/warcat | Tool for handling Web Archive files | 150 |
ukwa/webarchive-discovery | Tools for indexing and discovering archived web content | 116 |
internetarchive/warctools | Tools for working with archived web content | 152 |
webrecorder/har2warc | Converts HTTP Archive format to Web Archive format | 46 |
helgeho/archivespark | A framework for efficient data processing and extraction from archival collections, enabling the transformation of raw data into more accessible formats. | 145 |
nla/httrack2warc | Converts HTTrack crawls to WARC files by reconstructing requests and responses from logs | 30 |
peterk/warcworker | A web archiving tool that archives websites with high-fidelity preservation capabilities. | 55 |
lord/wargo | An tool for easy compilation and testing of Rust applications on WebAssembly platforms. | 261 |
archiveteam/grab-site | A web crawler designed to backup websites by recursively crawling and writing WARC files. | 1,398 |