Adapting Machine Translation Engines to the Needs of Cultural Heritage Metadata

Konstantinos Chatzitheodorou; Eirini Kaldeli; Antoine Isaac; Paolo Scalia; Carmen Grau Lacal; MªÁngeles García Escrivá

doi:10.5860/ital.v43i3.17247

Authors

Konstantinos Chatzitheodorou Researcher
Eirini Kaldeli
Antoine Isaac
Paolo Scalia
Carmen Grau Lacal
MªÁngeles García Escrivá

DOI:

https://doi.org/10.5860/ital.v43i3.17247

Keywords:

Europeana, EuropeanaTranslate Project, metadata, cultural heritage, machine translation

Abstract

The Europeana digital library features cultural heritage collections from over 3,000 European institutions described in 37 languages. However, most textual metadata describe the records in a single language, the data providers’ language. Improving Europeana’s multilingual accessibility presents challenges due to the unique characteristics of cultural heritage metadata, often expressed in short phrases and using in-domain terminology. This work presents the EuropeanaTranslate project’s approach and results, aimed at translating Europeana metadata records from 23 EU languages into English. Machine Translation engines were trained on a cleaned selection of bilingual and synthetic data from Europeana, including multilingual vocabularies and relevant cultural heritage repositories. Automatic translations were evaluated through standard metrics and human assessments by linguists and domain cultural heritage experts. The results showed significant improvements when compared to the generic engines used before the in-domain training as well as the eTranslation service for most languages. The EuropeanaTranslate engines have translated over 29 million metadata records on Europeana.eu. Additionally, the MT engines and training datasets are publicly available via the European Language Grid Catalogue and the ELRC-SHARE repository.

References

Alexander Soetaert, Luc Truyens, and Henk Vanstappen, “Brugge Grenzeloos Digitaal: Ondersteuning Meertaligheid: Eindrapport,” Datable BV, Antwerp, 2021, https://www.projectcest.be/w/images/2021_Grenzeloos_eindrapport.pdf.

Amin Farajian, Marco Turchi, Matteo Negri, and Marcello Federico, “Multi-Domain Neural Machine Translation through Unsupervised Adaptation,” in Proceedings of the Conference on Machine Technology (WMT) (Association for Computational Linguistics, 2017): 127–37, https://aclanthology.org/W17-4713.pdf.

Andrew Cameron Morris, Viktoria Maier, Phil Green, “From WER and RIL to MER and WIL: Improved Evaluation Measures for Connected Speech Recognition,” in Proceedings of Interspeech (2004): 2765–68, https://doi.org/10.21437/interspeech.2004-668.

Andy Neale, Antoine Isaac, H. Manguinas, and D. Moskalenko, Multilingual Strategy, Europeana, https://pro.europeana.eu/post/europeana-dsi-4-multilingual-strategy.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention Is All You Need,” in Advances in Neural Information Processing Systems 30 (Curran Associates, Inc., 2017), https://papers.nips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.

Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight, “Transfer Learning for Low-Resource Neural Machine Translation,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (Association for Computational Linguistics, 2016): 1568–75, https://doi.org/10.18653/v1/d16-1163.

Biao Zhang, Ivan Titov, and Rico Sennrich, “Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP–IJCNLP) (Association for Computational Linguistics, 2019): 898–909, https://doi.org/10.18653/v1/d19-1083.

Chenyang Li and Gongxu Luo, “Improving Zero-Shot Multilingual Neural Machine Translation for Low-Resource Languages,” arXiv, 2110.00712 (October 1, 2021), https://doi.org/10.48550/arXiv.2110.00712.

Dietrich Klakow and Jochen Peters, “Testing the Correlation of Word Error Rate and Perplexity,” Speech Communication 38, no. 1–2 (2002): 19–28, https://doi.org/10.1016/s0167-6393(01)00041-3.

Eirini Kaldeli, Mercedes García-Martínez, Antoine Isaac, Paolo Scalia, Arne Stabenau, Ivan Lena Almor, Carmen Grau Lacal, Martin Barroso Ordóñez, Amando Estela, and Manuel. Herranz, “Europeana Translate: Providing Multilingual Access to Digital Cultural Heritage,” in Proceedings of the 23rd Annual Conference of the European Association for Machine Translation (European Association for Machine Translation, 2022): 297–98.

Farhad Akhbardeh, Arkady Arkhangorodsky, Magdalena Biesialska, and Ondřej Bojar, R. Chatterjee, V. Chaudhary, M. R. Costa-jussa, et al., “Findings of the 2021 Conference on Machine Translation (WMT21),” in Proceedings of the Sixth Conference on Machine Translation (Association for Computational Linguistics, 2021): 1–88, https://aclanthology.org/2021.wmt-1.1.

Guillaume Klein, François Hernandez, Vincent Nguyen, and Jean Senellart, “The OpenNMT Neural Machine Translation Toolkit: 2020 Edition,” in Proceedings of the 14th Conference of the Association for Machine Translation in the Americas vol. 1, Research Track (Association for Machine Translation in the Americas, 2020): 102–109, https://aclanthology.org/2020.amta-research.9.

J. D. Cortés, “What Is the Mission of Innovation?—Lexical Structure, Sentiment Analysis, and Cosine Similarity of Mission Statements of Research-Knowledge Intensive Institutions” PLoS ONE 17 no. 8 (2022): e0267454, https://doi.org/10.1371/journal.pone.0267454.

Jacques Savoy and Martin Braschler, “Lessons Learnt from Experiments on the Ad Hoc Multilingual Test Collections at CLEF,” in Information Retrieval Evaluation in a Changing World: Lessons Learned from 20 Years of CLEF” (Springer, Cham, 2019): 177–200..

John White and Theresa O'Connell, “Evaluation in the ARPA Machine Translation Program, in Proceedings of Human Language Technology (Association for Computational Linguistics, 1994): 135–40, https://doi.org/10.3115/1075812.1075840.

Jörg Tiedemann, “Parallel Data, Tools and Interfaces in OPUS,” in Proceedings of the Eighth International Conference on Language Resource and Evaluation (European Language Resources Association (ELRA), 2012): 2214–18, https://aclanthology.org/L12-1246.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu, “BLEU: A Method for Automatic Evaluation of Machine Translation,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (Association for Computational Linguistics, 2001): 311–18, https://doi.org/10.3115/1073083.1073135.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu, “BLEU: A Method for Automatic Evaluation” in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics..

Lucia Specia, Kashif Shah, Jose G.C. de Souza, and Trevor Cohn, “QuEst – A Translation Quality Estimation Framework,” in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations (Association for Computational Linguistics, 2013): 79–84, https://aclanthology.org/P13-4014.

Maja Popović, “ChrF: Character N-Gram F-Score for Automatic MT Evaluation,” in Proceedings of the Tenth Workshop on Statistical Machine Translation (Association for Computational Linguistics, 2015): 392–95, https://doi.org/10.18653/v1/w15-3049.

Maristella Agosti, Erika Fabris, and Gianmaria Silvello, “On Synergies between Information Retrieval and Digital Libraries,” in Proceedings of the Italian Research Conference on Digital Libraries (Pisa, Italia: 2019): 3–17.

Matthew Snover, Bonnie Dorr, Rich Schwartz, Linnea Micciulla, and John Makhoul, “A Study of Translation Edit Rate with Targeted Human Annotation,” in Proceedings of the 7th Conference of the Association for Machine Translation in the Americas (Association for Machine Translation in the Americas, 2006): 223–31.

Matthew Snover, Bonnie Dorr, Rich Schwartz, Linnea Micciulla, and John Makhoul, “A Study of Translation Edit Rate” in Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, pages 223–231, Cambridge, Massachusetts, USA. Association for Machine Translation in the Americas..

Mercedes García-Martínez, Laurent Bié, Aleix Cerdà, Amando Estela, Manuel Herranz, Rihards Krišlauks, Maite Melero, Tony O’Dowd, Sinead O’Gorman, Marcis Pinnis, Artūrs Stafanovič, Riccardo Superbo, and Artūrs Vasiļevskis, “Neural Translation for European Union (NTEU),” in Proceedings of Neural Machine Translation XVIII: Users and Providers Track (Association for Machine Translation in the Americas: 2021): 316–34, https://aclanthology.org/2021.mtsummit-up.23.

Mikel Artetxe, Gorka Labaka, and Eneko Agirre, “Unsupervised Neural Machine Translation, a New Paradigm Solely Based on Monolingual Text,” Natural Language Processing 63 (September 2019): 151–54, RUA (Institutional Repository of the University of Alicante), https://rua.ua.es/dspace/handle/10045/96620.

Mónica Marrero and Antoine Isaac, “Implementation and Evaluation of a Multilingual Search Pilot in the Europeana Digital Library,” in Linking Theory and Practice of Digital Libraries: Proceedings of the 26th International Conference on Theory and Practice of Digital Libraries, Lecture Notes in Computer Science (LNCS) vol. 13541 (Springer, Cham, 2022): 93–106.

Mónica Marrero, Antoine Isaac, and Nuno Freire, “Automatic Translation and Multilingual Cultural Heritage Retrieval: A Case Study with Transcriptions in Europeana,” in Linking Theory and Practice of Digital Libraries: Proceedings of the 25th International Conference on Theory and Practice of Digital Libraries, Lecture Notes in Computer Science (LNCS) vol. 12866 (Springer, Cham, 2021): 133–38.

Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie, “COMET: A Neural Framework for MT Evaluation,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (Association for Computational Linguistics, 2022): 2685–2702, https://doi.org/10.18653/v1/2020.emnlp-main.213.

Rico Sennrich, Barry Haddow, and Alexandra Birch, “Improving Neural Machine Translation Models with Monolingual Data,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics vol. 1, Long Papers (Association for Computational Linguistics, 2016): 86–96, https://doi.org/10.18653/v1/p16-1009.

Sébastien Martin and Martin Weiß, “A Proof of Local Convergence for the Adam Optimizer,” in 2019 International Joint Conference on Neural Networks (IJCNN), (2019): 1–8, https://ieeexplore.ieee.org/document/8852239.

Valentine Charles, Antoine Isaac, Vassilis Tzouvaras, and Steffen Hennicke, “Mapping Cross-Domain Metadata to the Europeana Data Model (EDM),” in Research and Advanced Technology for Digital Libraries (Springer, January 2013): 484–85, https://doi.org/10.1007/978-3-642-40501-3_68.

Adapting Machine Translation Engines to the Needs of Cultural Heritage Metadata

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Developed By

Information