A family of language models trained on Eritrean languages. The goal is to have good base models that work well for Tigrinya, Tigre, and related languages out of the box.
Language Models Tigrinya Tigre
Tigrinya Abusive Language Detection: a multi-task benchmark with 13.7K annotated YouTube comments covering abusiveness, sentiment, and topic classification in both Ge'ez script and Latin transliterations.
Content Moderation Multi-Task Dataset
Read Paper → View Project →Tigrinya Question Answering Dataset and Models: Developed the first human-annotated question answering benchmark dataset for Tigrinya, with 10.6K QA pairs from 290 news articles. Received an Outstanding Paper Award at ACL.
Question Answering Dataset Tigrinya
Read Paper → View Project →A benchmark dataset and evaluation for Language Identification (LI) targeting five typologically and phylogenetically related low-resourced East African languages that use the Ge'ez script.
Language ID Dataset Ge'ez Script
Read Paper → View Project →Monolingual pre-trained language models for Tigrinya (TiRoBERTa, TiBERT, TiELECTRA), trained to provide strong baseline representations for downstream tasks.
Language Models Pre-training Tigrinya
View Paper → View Project →Tigrinya Language Modeling Dataset: A large-scale monolingual dataset collected from news articles, blogs, and books, featuring approximately 40 million tokens for training language models.
Corpus Language Modeling Tigrinya
Dataset →Rules and algorithms for converting numbers to written Tigrinya and back. Useful for text-to-speech, accessibility, and localization. Released as an open-source Python package.
Number Verbalization TTS
Read Paper → View Project →GLOCR is a large-scale open-source dataset for Text Recognition and Optical Character Recognition (OCR) of the Tigrinya language, featuring over 660,000 image-label pairs.
OCR Dataset Tigrinya
Dataset → View Project →A Tigrinya adaptation of the classic Google Analogy Test set, containing over 18K entries to empirically evaluate the semantic and syntactic qualities of word-embedding models.
Evaluation Word Embeddings Tigrinya
View Project →A statistically generated bilingual lexicon between English and Tigrinya, built using parallel corpora without human supervision to aid research in low-resource environments.
Lexicon Bilingual
View Project →Comprehensive word count compilations and stop-word lists curated for both Tigrinya and Tigre languages to support foundational NLP research.
Lexical Data Tigrinya Tigre
Tigrinya → Tigre →A review of over 50 NLP studies on Tigrinya published between 2011 and 2025, covering machine translation, morphology, QA, speech, and more. Also maintained as an open bibliography on GitHub.
Review Anthology Tigrinya
Read Paper → View Project →