GLaRe - GeezLab Research Program

Research Projects

Work conducted or supported by GLaRe:

Dahlak LMs In Progress

A family of language models trained on Eritrean languages. The goal is to have good base models that work well for Tigrinya, Tigre, and related languages out of the box.

Language Models Tigrinya Tigre

TiALD NeurIPS 2025

Tigrinya Abusive Language Detection: a multi-task benchmark with 13.7K annotated YouTube comments covering abusiveness, sentiment, and topic classification in both Ge'ez script and Latin transliterations.

Content Moderation Multi-Task Dataset

Read Paper → View Project →

TiQuAD ACL 2023

Tigrinya Question Answering Dataset and Models: Developed the first human-annotated question answering benchmark dataset for Tigrinya, with 10.6K QA pairs from 290 news articles. Received an Outstanding Paper Award at ACL.

Question Answering Dataset Tigrinya

Read Paper → View Project →

TiNC24 Springer LRE 2025

Tigrinya NER Corpus: A large-scale human-labeled dataset with over 200K words tagged for Named Entity Recognition across 8 entity classes and 10 domains, with models achieving 90.18% F1. Led by students at Mai Nefhi College of Engineering.

NER Dataset Tigrinya

Read Paper → View Project →

TiPOSC24 ICNLP 2026

Tigrinya POS Corpus: An expanded part-of-speech tagging dataset with over 118K tokens annotated across 12 POS classes. Achieved 95.6% F1 using TiRoBERTa, a 5.6 point improvement over prior work. Led by students at Mai Nefhi College of Engineering.

POS Tagging Dataset Tigrinya

Read Paper → View Project →

GeezSwitch LREC 2022

A benchmark dataset and evaluation for Language Identification (LI) targeting five typologically and phylogenetically related low-resourced East African languages that use the Ge'ez script.

Language ID Dataset Ge'ez Script

Read Paper → View Project →

Tigrinya PLMs EMNLP/WiNLP 2021

Monolingual pre-trained language models for Tigrinya (TiRoBERTa, TiBERT, TiELECTRA), trained to provide strong baseline representations for downstream tasks.

Language Models Pre-training Tigrinya

View Paper → View Project →

TLMD Dataset

Tigrinya Language Modeling Dataset: A large-scale monolingual dataset collected from news articles, blogs, and books, featuring approximately 40 million tokens for training language models.

Corpus Language Modeling Tigrinya

Dataset →

Verbalizing Numbers GitHub

Rules and algorithms for converting numbers to written Tigrinya and back. Useful for text-to-speech, accessibility, and localization. Released as an open-source Python package.

Number Verbalization TTS

Read Paper → View Project →

GeezLab OCR Dataset

GLOCR is a large-scale open-source dataset for Text Recognition and Optical Character Recognition (OCR) of the Tigrinya language, featuring over 660,000 image-label pairs.

OCR Dataset Tigrinya

Dataset → View Project →

Analogy Test Dataset

A Tigrinya adaptation of the classic Google Analogy Test set, containing over 18K entries to empirically evaluate the semantic and syntactic qualities of word-embedding models.

Evaluation Word Embeddings Tigrinya

View Project →

Tigrinya-BiLexicon GitHub

A statistically generated bilingual lexicon between English and Tigrinya, built using parallel corpora without human supervision to aid research in low-resource environments.

Lexicon Bilingual

View Project →

Word Frequencies GitHub

Comprehensive word count compilations and stop-word lists curated for both Tigrinya and Tigre languages to support foundational NLP research.

Lexical Data Tigrinya Tigre

Tigrinya → Tigre →

Tigrinya Anthology GitHub

A review of over 50 NLP studies on Tigrinya published between 2011 and 2025, covering machine translation, morphology, QA, speech, and more. Also maintained as an open bibliography on GitHub.

Review Anthology Tigrinya

Read Paper → View Project →

Research in AI & NLP for Native Languages

What We Do

Fundamental Research

Open Resources

Support Researchers

Research Projects

Research Proposals