Rationalized English-French Semantic Divergences: annotation workflow

Bits and pieces missing from (my) paper descriptions.

Eleftheria Briakou

October 27, 2020

In this blog post, I go through (all) annotation decisions involved in collecting the Rationalized English-French Semantic Divergences corpus, dubbed REFreSD. My main goal is not to describe REFreSD per se but rather to use it as an example to provide a holistic view of the annotation workflow that is often missing from paper descriptions. Therefore, I will not cover details that are specific to the dataset at hand. If you are interested in that, you can check out the Appendix of our paper.

This blog provides:

✔️ a complete depiction of the annotation workflow;
✔️ a full description of Data Statements;
✔️ a DataSheet for REFreSD;
✔️ a list of documents reviewed by the Institutional Review Board at UMD.

If you don’t know what Data Statements and Datasheets for Datasets are, follow the links and check out the papers!

Introduction

REFreSD is published at EMNLP 2020 by Eleftheria and Marine.

Motivation: The project under which REFreSD is collected aims to advance our fundamental understanding of the computational representations and methods to compare and contrast text meaning across languages. Currently, much cross-lingual work in Natural Language Processing relies on the assumption that sentences drawn from parallel corpora are equivalent in meaning. Yet, content conveyed in two distinct languages is rarely exactly equivalent. The ability of computational methods to detect such meaning mismatches can be assessed by comparing their predictions with human judgments in REFreSD.

Annotation task: Human annotators were asked to read text excerpts in two languages (e.g., one in English and another in French). We collect their assessment of the meaning differences they observe via sentence-level divergence judgments token-level rationales.

Annotation workflow

This chart presents the annotation workflow used for collecting REFreSD. People involved in each step of the process are listed on the right, while documents those people are presented with are shown on the left.

Documentation

We publish all documents related to the (left side) annotation workflow:

1️⃣ GUIDELINES includes the exact wording presented to participants;
2️⃣ PROCEDURES covers text reviewed by the Institutional Review Board at UMD;
3️⃣ ADVERTISING presents the email used to recruit participants;
4️⃣ DATA STATEMETS for REFreSD;
5️⃣ DATASHEET for REFreSD.

Baselines

The Divergent mBERT paper presents several baselines for the prediction of semantic divergences at a sentence and token level. Code is available here.

Acknowledgments

This dataset was collected after discussions and feedback among the following folks: Sweta Agrawal, Dennis Asamoah Owusu, Valerio Basile, Emily Bender, Tommaso Caselli, Pranav Goel, Ching-Lin Huang, Nina Kamooei, Marianna Martindale, Alexander Miserlis Hoyle, Aquia Richburgh, and Weijia Xu. Thanks all!

References

Eleftheria Briakou

I research Multilingual NLP and Machine Translation.