gv-crawl

Text aligner

Automates text extraction and alignment from Global Voices articles to create parallel corpora for low-resource languages.

Global Voices bitext crawler

GitHub

9 stars
1 watching
4 forks
Language: Python
last commit: about 10 years ago
Linked from 1 awesome list


Backlinks from these awesome lists:

Related projects:

Repository Description Stars
gregorut/vgchartzscrape A Python script that captures data from vgchartz.com and saves it to a CSV file 79
elliotgao2/gain A Python web crawling framework utilizing asyncio and aiohttp for efficient data extraction from websites. 2,035
chenjiandongx/github-spider A Python-based web crawler for scraping Github user and repository data. 264
jmg/crawley A Pythonic framework for building high-speed web crawlers with flexible data extraction and storage options. 186
cocrawler/cocrawler A versatile web crawler built with modern tools and concurrency to handle various crawl tasks 187
puerkitobio/gocrawl A concurrent web crawler written in Go that allows flexible and polite crawling of websites. 2,038
x-plug/cvalues Evaluates and aligns the values of Chinese large language models with safety and responsibility standards 477
0xvavaldi/gramify Analyzes text data to extract patterns of words or characters for password cracking and analysis purposes. 28
vchitect/vbench A tool for evaluating and benchmarking video generative models in computer vision and artificial intelligence 576
kahunalu/pwnbin Searches public pastebins for specified keywords and returns matching results 427
machinalis/yalign Automates the process of extracting parallel sentences from comparable corpora to aid in statistical machine translation 127
a11ywatch/crawler Performs web page crawling at high performance. 49
vida-nyu/ache A web crawler designed to efficiently collect and prioritize relevant content from the web 454
gentlegiantjgc/pymctranslate Enables data translation between Minecraft versions and platforms via an intermediate format. 27
jwvhewitt/dmeternal A dungeon crawler game written in Python, featuring procedurally generated content and turn-based gameplay. 57