Evaluation

Multilingual Evaluation

As a community, we have overfitted the characteristics of English-language data when modeling various tasks, does the same hold for our evaluation metrics? How can we evaluate natural language generation when moving multilingual?