Skip to content

ciralproject/ciral

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🌍 CIRAL

CIRAL (Cross-Lingual Information Retrieval for African Languages) is a test collection focused on promoting the research and evaluation of Cross-Lingual Information Retrieval (CLIR) for African languages. Our collection covers cross-lingual retrieval between English and 4 African languages, with queries in English and passages in the African languages. This repo provides details of the test collection, guidelines for system evaluations and baselines.

Hosted as a track at the Forum for Information Retrieval Evaluation (FIRE) 2023, the goal of our track was to promote participation and community evaluations in CLIR for African languages. More information regarding the track can be found at the website: Ciral@Fire2023

📚 Corpora

The current languages in CIRAL are Hausa, Swahili, Somali and Yoruba. The corpora consists of passages from news articles, mined from indigenous websites of the different languages.

Link to Dataset: https://huggingface.co/datasets/CIRAL/ciral-corpus/

Statistics and details of the collection are found below.

Language News Sources # of Passages # of Articles Link
Hausa (hau) LegitNG, DailyTrust, VOA, Isyaku, etc. 715,355 240,883 🤗
Somali (som) VOA, UN Swahili, MTanzania, etc. 827,552 629,441 🤗
Swahili (swa) VOA, Tuko, Risaala, Caasimada, etc. 949,013 146,669 🤗
Yoruba (yor) Alaroye, VON, BBC, Asejere, etc. 82,095 27,985 🤗

For each language, passages are stored in JSONL files where each line corresponds to a passage in JSON format. The fields provided for each passage include:

  • docid: Unique identifier of the passage
  • title: Title of the news article from which the passage was extracted
  • text: Text of the passage
  • url: News article url

📚 Queries and Relevance Judgements

CIRAL's queries and relevance judgements are provided for the four languages in three sets: development set, test set A and test set B. Additionally, test set A consists of pools curated from CIRAL's shared task. The queries and relevance judgement files can be accessed in the Hugging Face repo.

Statistics for the queries and relevance judgements are given below. The development set consists of few samples to analyze relevance and evaluate proposed systems using the provided judgements.

Dev Test A Test B
Language #Q #J #Q #J Pool Size #Q #J Link
Hausa (ha) 10 165 80 1,447 7,288 312 5,930 🤗
Somali (so) 10 187 99 1,798 9,094 239 4,324 🤗
Swahili (sw) 10 196 85 1,656 8,079 113 2,175 🤗
Yoruba (yo) 10 185 100 1,921 8,311 554 10, 569 🤗

Both query and relevance judgements files are in the .tsv format. Each line in the query file is represented as:

qid\tquery

while the judgements are in the standard TREC format:

qid Q0 docid relevance

📚 Guidelines and Resources

Task Description

This task entails queries formulated as natural language questions in English, and retrieval done at the passage-level for the different African languages. Information retrieval systems developed for the task will receive the collection of passages and a set of queries for the different African languages. For each query in the test set, proposed systems are to return a ranked list of passages ordered by likelihood of binary relevance to the query. Up to 1000 passages per query can be submitted, results with more than 1000 would be truncated.

Details regarding participation can be found in this section of the website.

Getting started with IR and CIRAL

For more details on getting started with IR and understanding the task, please check the provided Quick Start

🔎 Baselines and Evaluation

Baselines and reproduction guides are provided in this section. Please note that this only covers searching, as the indexes have already been built.

The baselines can be reproduced using Pyserini. To reproduce the baselines:

  1. Install the development version of Pyserini by following this guide.
  2. Follow the commands in the 2-click-reproduction (2CR)

Our reranking baseline models are also available on Hugging Face: mT5, AfrimT5.

Citation

@inproceedings{10.1145/3626772.3657884,
author = {Adeyemi, Mofetoluwa and Oladipo, Akintunde and Zhang, Xinyu and Alfonso-Hermelo, David and Rezagholizadeh, Mehdi and Chen, Boxing and Omotayo, Abdul-Hakeem and Abdulmumin, Idris and Etori, Naome A. and Musa, Toyib Babatunde and Fanijo, Samuel and Awoyomi, Oluwabusayo Olufunke and Salahudeen, Saheed Abdullahi and Mohammed, Labaran Adamu and Abolade, Daud Olamide and Lawan, Falalu Ibrahim and Sabo Abubakar, Maryam and Nasir Iro, Ruqayya and Imam Abubakar, Amina and Mohamed, Shafie Abdi and Mohamed, Hanad Mohamud and Ajayi, Tunde Oluwaseyi and Lin, Jimmy},
title = {CIRAL: A Test Collection for CLIR Evaluations in African Languages},
year = {2024},
isbn = {9798400704314},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3626772.3657884},
doi = {10.1145/3626772.3657884},
pages = {293–302},
numpages = {10},
keywords = {african languages, cross-lingual information retrieval},
location = {Washington DC, USA},
series = {SIGIR '24}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published