GeezLab Research (GLaRe)

Research in AI & NLP for Native Languages

GLaRe is where we do the research side of GeezLab. We work on AI and Natural Language Processing for Eritrean languages, and support students and researchers who are underrepresented in the field.
(glare: to shine with a strong or dazzling light)

Research Projects

What We Do

We care about languages that don't yet have the tools they deserve.

🔬

Basic Research

We study AI and NLP problems specific to Tigrinya, Tigre, and other Eritrean languages. Most of these languages are severely underserved by existing tools, and we want to change that.

🎓

Supporting Researchers

We fund and mentor students and researchers who are working on language technology but lack institutional support. Good ideas shouldn't go unfunded just because of where you're from.

🌍

Open Resources

Everything we build, we share. Datasets, models, code, and tools are published openly so others can use them, build on them, and push the work further.

Research Projects

Work funded and conducted through GLaRe

Dahlak LMs In Progress

A family of language models trained on Eritrean languages. The goal is to have good base models that work well for Tigrinya, Tigre, and related languages out of the box.

Language Models Tigrinya Tigre
TiALD NeurIPS 2025

Tigrinya Abusive Language Detection: a multi-task benchmark with 13.7K annotated YouTube comments covering abusiveness, sentiment, and topic classification in both Ge'ez script and Latin transliterations.

Content Moderation Multi-Task Dataset
Read Paper View Project
TiQuAD ACL 2023

Tigrinya Question Answering Dataset and Models: Developed the first human-annotated question answering benchmark dataset for Tigrinya, with 10.6K QA pairs from 290 news articles. Received an Outstanding Paper Award at ACL.

Question Answering Dataset Tigrinya
Read Paper View Project
GeezSwitch LREC 2022

A benchmark dataset and evaluation for Language Identification (LI) targeting five typologically and phylogenetically related low-resourced East African languages that use the Ge'ez script.

Language ID Dataset Ge'ez Script
Read Paper View Project
Tigrinya PLMs WiNLP 2021

Monolingual pre-trained language models for Tigrinya (TiRoBERTa, TiBERT, TiELECTRA), trained to provide strong baseline representations for downstream tasks.

Language Models Pre-training Tigrinya
View Paper View Project
TLMD Dataset

Tigrinya Language Modeling Dataset: A large-scale monolingual dataset collected from news articles, blogs, and books, featuring approximately 40 million tokens for training language models.

Corpus Language Modeling Tigrinya
Dataset
Verbalizing Numbers GitHub

Rules and algorithms for converting numbers to written Tigrinya and back. Useful for text-to-speech, accessibility, and localization. Released as an open-source Python package.

Number Verbalization TTS
Read Paper View Project
GeezLab OCR Dataset

GLOCR is a large-scale open-source dataset for Text Recognition and Optical Character Recognition (OCR) of the Tigrinya language, featuring over 660,000 image-label pairs.

OCR Dataset Tigrinya
Dataset View Project
Analogy Test Dataset

A Tigrinya adaptation of the classic Google Analogy Test set, containing over 18K entries to empirically evaluate the semantic and syntactic qualities of word-embedding models.

Evaluation Word Embeddings Tigrinya
View Project
Tigrinya-BiLexicon GitHub

A statistically generated bilingual lexicon between English and Tigrinya, built using parallel corpora without human supervision to aid research in low-resource environments.

Lexicon Bilingual
View Project
Word Frequencies GitHub

Comprehensive word count compilations and stop-word lists curated for both Tigrinya and Tigre languages to support foundational NLP research.

Lexical Data Tigrinya Tigre
Tigrinya Tigre
Tigrinya Anthology GitHub

A review of over 50 NLP studies on Tigrinya published between 2011 and 2025, covering machine translation, morphology, QA, speech, and more. Also maintained as an open bibliography on GitHub.

Review Anthology Tigrinya
Read Paper View Project

Research Proposals

If you are working on NLP or AI for underrepresented languages, we may be able to help. GLaRe provides research resources, mentorship, and funding to selected research projects.

Send a brief proposal to research@geezlab.com with your abstract and what kind of support you need.