Unlocking the Digitized Historical Newspaper Archive
Exploring Historical Insights with Deep Learning
DOI:
https://doi.org/10.5860/ital.v44i3.17292Keywords:
Computer Vision, Image Analysis, Machine Learning, Deep Learning, Geographic Information System, GISAbstract
This paper aims to utilize historical newspapers through the application of computer vision and machine/deep learning to extract the headlines and illustrations from newspapers for storytelling. This endeavor seeks to unlock the historical knowledge embedded within newspaper contents while simultaneously utilizing cutting-edge methodological paradigms for research in the digital humanities (DH) realm. We targeted to provide another facet apart from the traditional search or browse interfaces and incorporated those DH tools with place- and time-based visualizations. Experimental results showed our proposed methodologies in OCR (optical character recognition) with scraping and deep learning object detection models can be used to extract the necessary textual and image content for more sophisticated analysis. Timeline and geodata visualization products were developed to facilitate a comprehensive exploration of our historical newspaper data. The timeline-based tool spanned the period from July 1942 to July 1945, enabling users to explore the evolving narratives through the lens of daily headlines. The interactive geographical tool can enable users to identify geographic hotspots and patterns. Combining both products can enrich users’ understanding of the events and narratives unfolding across time and space.
References
A. S. Haider and R. F. Hussein, “Analysing Headlines as a Way of Downsizing News Corpora: Evidence from an Arabic-English Comparable Corpus of Newspaper Articles,” Digital Scholarship in the Humanities 35, no. 4 (2019): 826–44, https://doi.org/10.1093/llc/fqz074.
Anni Järvelin et al., “Information Retrieval from Historical Newspaper Collections in Highly Inflectional Languages: A Query Expansion Approach,” Journal of the Association for Information Science and Technology 67, no. 12 (2016): 2928–46, https://doi.org/10.1002/asi.23379.
B. C. G. Lee et al., “The Newspaper Navigator Dataset: Extracting Headlines and Visual Content from 16 Million Historic Newspaper Pages in Chronicling America,” in Proceedings of the 29th ACM International Conference on Information & Knowledge Management (ACM, 2020), 3055–62, https://doi.org/10.1145/3340531.3412767.
C. Develotte and E. Rechniewski, “Discourse Analysis of Newspaper Headlines: A Methodological Framework for Research into National Representations,” Web Journal of French Media Studies 4, no 1. (2001).
D. Bellis, “The Hongkong News,” Gwulo: Old Hong Kong, February 15, 2012, https://gwulo.com/the-hongkong-news.
D. Dor, “On Newspaper Headlines as Relevance Optimizers,” Journal of Pragmatics 35, no. 5 (2003): 695–721, https://doi.org/10.1016/S0378-2166(02)00134-0.
E. A. Msuya, “Analysis of Newspaper Headlines: A Case of Two Tanzanian English Dailies,” Journal of Education, Humanities, and Sciences, 8 (2019).
J. R. Chaudhary and J. Paulose, “Opinion Mining on Newspaper Headlines using SVM and NLP,” International Journal of Electrical and Computer Engineering, 9, no. 3 (2019): 2152–63, https://doi.org/10.11591/ijece.v9i3.pp2152-2163.
M. Arshad and N. Khan, “A Critical Discourse Analysis of the Pakistani Newspaper Headlines on the Federal Budget for FY 2021–2022,” Journal of Humanities, Social and Management Sciences (JHSMS) 2, no. 1 (2021): 176–86, https://doi.org/10.47264/idea.jhsms/2.1.15.
M. K. F. Yip and V. W. Y. Lum, “Headline Analysis with Machine Learning on The Hongkong News,” 2023, https://dsprojects.lib.cuhk.edu.hk/en/projects/heading-analysis-machine-learning-hongkong-news/tabloid-hknews-geodata-visualization/.
N. Aqromi, “An Analysis of Metaphor for Corona on Headlines News,” Pioneer: Journal of Language and Literature 12, no. 2 (2020): 157, https://doi.org/10.36841/pioneer.v12i2.734.
R. Saha, A. Mondal, and C. V. Jawahar, “Graphical Object Detection in Document Images,” in 2019 International Conference on Document Analysis and Recognition (ICDAR) (The Institute of Electrical and Electronics Engineers, Inc., 2019), 51–58.
Sanna Kumpulainen and Elina Late, “Struggling with Digitized Historical Newspapers: Contextual Barriers to Information Interaction in History Research Activities,” Journal of the Association for Information Science and Technology 73, no. 7 (2022): 1012–24, https://doi.org/10.1002/asi.24608.
T. Fogec, “Critical Discourse Analysis of Tabloid Headlines” (diploma thesis, Filozofski fakultet u Zagrebu, 2014), http://darhiv.ffzg.unizg.hr/id/eprint/5215/.
X. Yi et al., “CNN Based Page Object Detection in Document Images,” in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) (The Institute of Electrical and Electronics Engineers, 2017), 230–35, https://doi.org/10.1109/icdar.2017.46.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Vincent Wai-Yip Lum, Michael Kin-Fu Yip

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Authors that submit to Information Technology and Libraries agree to the Copyright Notice.