As a community, we have overfitted the characteristics of English-language data when modeling various tasks, does the same hold for our evaluation metrics? How can we evaluate natural language generation when moving multilingual?
      
    
   
  
  
    
    
      
      A dominant hypothesis in multilingual research is that models developed and optimized for English can be seamlessly transferred (and perform well!) into a new language by simply accessing data in that language. Yet the same task (e.g., formality transfer) might involve modeling different linguistic characteristics across languages. How well can we perform style transfer beyond English?