In this blog post, I go through (all) annotation decisions involved in the collection of the Rationalized English-French Semantic Divergences corpus, dubbed REFreSD. My main goal is not to describe REFreSD per se but rather to use it as an example to provide a holistic view of the annotation workflow that is often missing from paper descriptions. Therefore, I will not cover details that are too specific to the dataset at hand. If you are interested in that, you can check out the Appendix of our paper.
This blog provides:
✔️ a complete depiction of the annotation workflow;
✔️ a full description of Data Statements;
✔️ a DataSheet for REFreSD;
✔️ a list of documents reviewed by the Institutional Review Board at UMD.
Motivation: The project under which REFreSD is collected aims to advance our fundamental understanding of the computational representations and methods to compare and contrast text meaning across languages. Currently, much cross-lingual work in Natural Language Processing relies on the assumption that sentences drawn from parallel corpora are equivalent in meaning. Yet, content conveyed in two distinct languages is rarely exactly equivalent. The ability of computational methods to detect such meaning mismatches can be assessed by comparing their predictions with human judgments in REFreSD.
Annotation task: Human annotators were asked to read text excerpts in two languages (e.g., one in English and another in French). We collect their assessment of the meaning differences they observe via sentence-level divergence judgments token-level rationales.
This chart presents the annotation workflow used for collecting REFreSD. People involved in each step of the process are listed on the right, while documents those people are presented with are shown on the left.
We publish all documents related to the (left side) annotation workflow:
1️⃣ GUIDELINES includes the exact wording presented to participants;
2️⃣ PROCEDURES covers text reviewed by the Institutional Review Board at UMD;
3️⃣ ADVERTISING presents the email used to recruit participants;
4️⃣ DATA STATEMETS for REFreSD;
5️⃣ DATASHEET for REFreSD.
The Divergent mBERT paper presents several baselines for the prediction of semantic divergences at a sentence and token level. Code is available here.
This dataset was collected after discussions and feedback among the following folks: Sweta Agrawal, Dennis Asamoah Owusu, Valerio Basile, Emily Bender, Tommaso Caselli, Pranav Goel, Ching-Lin Huang, Nina Kamooei, Marianna Martindale, Alexander Miserlis Hoyle, Aquia Richburgh, and Weijia Xu. Thanks all!