gv-crawl

Text aligner

Automates text extraction and alignment from Global Voices articles to create parallel corpora for low-resource languages.

Global Voices bitext crawler

GitHub

9 stars
1 watching
4 forks
Language: Python
last commit: over 10 years ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
gregorut/vgchartzscrape A Python script that captures data from vgchartz.com and saves it to a CSV file 80
elliotgao2/gain A Python web crawling framework utilizing asyncio and aiohttp for efficient data extraction from websites. 2,037
chenjiandongx/github-spider A Python-based web crawler for scraping Github user and repository data. 264
jmg/crawley A Pythonic framework for building high-speed web crawlers with flexible data extraction and storage options. 188
cocrawler/cocrawler A versatile web crawler built with modern tools and concurrency to handle various crawl tasks 188
puerkitobio/gocrawl A concurrent web crawler written in Go that allows flexible and polite crawling of websites. 2,036
x-plug/cvalues Evaluates and aligns the values of Chinese large language models with safety and responsibility standards 481
0xvavaldi/gramify Analyzes text data to extract patterns of words or characters for password cracking and analysis purposes. 28
vchitect/vbench A benchmark suite for evaluating the performance of video generative models 643
kahunalu/pwnbin Searches public pastebins for specified keywords and returns matching results 428
machinalis/yalign Automates the process of extracting parallel sentences from comparable corpora to aid in statistical machine translation 127
a11ywatch/crawler Performs web page crawling at high performance. 51
vida-nyu/ache A web crawler designed to efficiently collect and prioritize relevant content from the web 459
gentlegiantjgc/pymctranslate Enables data translation between Minecraft versions and platforms via an intermediate format. 27
jwvhewitt/dmeternal A dungeon crawler game written in Python, featuring procedurally generated content and turn-based gameplay. 57