Web Archives Metadata Generation with GPT-4o

Challenges and Insights

Authors

DOI:

https://doi.org/10.5860/ital.v44i2.17305

Keywords:

Metadata, Large Language Models, Digital Archives, web preservation

Abstract

Current metadata creation for web archives is time consuming and costly due to reliance on human effort. This paper explores the use of GPT-4o for metadata generation within the Web Archive Singapore, focusing on scalability, efficiency, and cost effectiveness. We processed 112 Web ARChive (WARC) files using data reduction techniques, achieving a notable 99.9% reduction in metadata generation costs. By prompt engineering, we generated titles and abstracts, which were evaluated both intrinsically using Levenshtein distance and BERTScore, and extrinsically with human cataloguers using McNemar’s test. Results indicate that while our method offers significant cost savings and efficiency gains, human curated metadata maintains an edge in quality. The study identifies key challenges including content inaccuracies, hallucinations, and translation issues, suggesting that large language models (LLMs) should serve as complements rather than replacements for human cataloguers. Future work will focus on refining prompts, improving content filtering, and addressing privacy concerns through experimentation with smaller models. This research advances the integration of LLMs in web archiving, offering valuable insights into their current capabilities and outlining directions for future enhancements. The code is available at https://github.com/masamune-prog/warc2summary for further development and use by institutions facing similar challenges.

References

A. K. Wong and D. K. W. Chiu, “Digital Curation Practices on Web and Social Media Archiving in Libraries and Archives,” Journal of Librarianship and Information Science (2024), https://doi.org/10.1177/09610006241252661.

Alec Radford et al., “Learning Transferable Visual Models from Natural Language Supervision,” in International Conference on Machine Learning (PMLR, 2021), 8748–63.

Anthropic, Claude 3.5 Sonnet, June 21, 2024, https://www.anthropic.com/news/claude-3-5-sonnet.

Ashish Vaswani et al., “Attention Is All You Need,” in Advances in Neural Information Processing Systems 30, ed. I. Guyon et al. (2017), https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

Chaoyi Wu et al., “PMC-LLAMA: Towards Building Open-Source Language Models for Medicine,” 2023, accessed August 10, 2024, https://arxiv.org/abs/2304.14454.

Charles Babbage, Passages from the Life of a Philosopher, (Longman, Green, Longman, Roberts, & Green, 1864).

Common Crawl, https://commoncrawl.org/.

DCMI-Libraries Working Group, DC-Libraries - Library Application Profile - Draft (technical report), Dublin Core Metadata Initiative, September 2004, accessed August 10, 2024, https://dublincore.org/specifications/dublin-core/library-application-profile/.

Denny Zhou et al., “Least-to-Most Prompting Enables Complex Reasoning in Large Language Models,” 2023, accessed August 10, 2024, https://arxiv.org/abs/2205.10625.

E. H. C. Chow, T.G. Kao, and X. Li, “An Experiment with the Use of ChatGPT for LCSH Subject Assignment on Electronic Theses and Dissertations,” accessed August 23, 2024, http://arxiv.org/abs/2403.16424.

Emily Maemura, “All WARC and No Playback: The Materialities of Data-Centered Web Archives Research.” Big Data & Society 10, no. 1 (2023), https://journals.sagepub.com/doi/10.1177/20539517231163172.

“Frequently Asked Questions,” National Library Board, Web Archive Singapore, accessed August 22, 2024, https://eresources.nlb.gov.sg/webarchives/faq.

IFLA, “Session 157: Utopia, Threat or Opportunity First? Artificial Intelligence and Machine Learning for Cataloguing”, 2023, Full Programme, 88th IFLA General Conference and Assembly, accessed August 22, 2024, https://iflawlic2023.abstractserver.com/program/#/details/sessions/276.

IIPC, “WAC 2024 Program”, 2024, accessed August 22, 2024, https://netpreserve.org/ga2024/programme/wac/.

Ilya Kreymer, warcio, 2020, GitHub repository, https://github.com/webrecorder/warcio.

Jackie Dooley and Kate Bowers, “Descriptive Metadata for Web Archiving: Recommendations of the OCLC Research Library Partnership Web Archiving Metadata Working Group,” OCLC Research (February 17, 2018), https://www.oclc.org/research/publications/2018/oclcresearch-descriptive-metadata/recommendations.html.

Janek Bevendorff et al., “FastWARC: Optimizing Large-Scale Web Archive Analytics,” 2021, https://arxiv.org/abs/2112.03103.

Jason Liu, Instructor: Structured LLM Outputs, 2024, GitHub repository, accessed August 13, 2024, https://github.com/jxnl/instructor.

Jason Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” in Advances in Neural Information Processing Systems, vol. 35, ed. S. Koyejo et al. (Curran Associates, Inc., 2022), 24824–37, https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf.

Laurie Allen, “Why Experiment: Machine Learning at the Library of Congress,” The Signal: Digital Happenings at the Library of Congress, Library of Congress Blogs, November 13, 2023, https://blogs.loc.gov/thesignal/2023/11/why-experiment-machine-learning-at-the-library-of-congress/.

Long Ouyang et al., “Training Language Models to Follow Instructions with Human Feedback,” in Advances in Neural Information Processing Systems 36, ed. S. Koyejo et al. (Curran Associates, Inc., 2022), 27730–44, https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf.

M. Cargnelutti, K. Mukk, and C. Stanton, “WARC-GPT: An Open-Source Tool for Exploring Web Archives Using AI,” Harvard Law School Library Innovation Lab blog, 2024, accessed August 23, 2024, https://lil.law.harvard.edu/blog/2024/02/12/warc-gpt-an-open-source-tool-for-exploring-web-archives-with-ai/.

Maciej Besta et al., “Graph of Thoughts: Solving Elaborate Problems with Large Language Models,” Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 16 (March 2024): 17682–90, https://doi.org/10.1609/aaai.v38i16.29720.

Muhammad Zakir et al., “Navigating the Legal Labyrinth: Establishing Copyright Frameworks for AI-Generated Content,” Remittances Review 9 (January 2024): 2515–32, https://remittancesreview.com/article-detail/?id=1467

National Library Board Act 1995, Singapore Statutes Online, accessed August 11, 2024, https://sso.agc.gov.sg/Act/NLBA1995.

National Library Board, accessed August 10, 2024, https://www.nlb.gov.sg/main/home.

Nicholas Carlini et al., “Extracting Training Data from Large Language Models,” in 30th USENIX Security Symposium (USENIX Security 21), USENIX Association, August 2021, 2633–50. https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting.

OpenAI, Hello GPT-4o, May 13, 2024, https://openai.com/index/hello-GPT-4o/.

Quinn McNemar, “Note on the Sampling Error of the Difference Between Correlated Proportions or Percentages,” Psychometrika 12, no. 2 (1947): 153–57, https://doi.org/10.1007/BF02295996.

R. Brzustowicz, “From ChatGPT to CatGPT: The Implications of Artificial Intelligence on Library Cataloguing,” Information Technology and Libraries 42, no. 3 (2023), https://doi.org/10.5860/ital.v42i3.16295.

Rafael Rafailov et al., “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model,” in Advances in Neural Information Processing Systems, vol. 36, ed. A. Oh et al. (Curran Associates, Inc., 2023), 53728–41. https://proceedings.neurips.cc/paper_files/paper/2023/file/a85b405ed65c6477a4fe8302b5e06ce7-Paper-Conference.pdf.

Shereen Tay, “An Archive of Singapore Websites: Preserving the Digital,” BiblioAsia 16 no. 3 (October–December 2020), https://biblioasia.nlb.gov.sg/vol-16/issue-3/oct-dec-2020/website/.

Shunyu Yao et al., “Tree of Thoughts: Deliberate Problem Solving with Large Language Models,” in Advances in Neural Information Processing Systems, vol. 36, ed. A. Oh et al. (Curran Associates, Inc., 2023), 11809–22, https://proceedings.neurips.cc/paper_files/paper/2023/file/271db9922b8d1f4dd7aaef84ed5ac703-Paper-Conference.pdf.

Sundar Pichai and Demis Hassabis, “Introducing Gemini: Our Largest and Most Capable AI Model,” Google Blog, December 6, 2023, https://blog.google/technology/ai/google-gemini-ai/#sundar-note.

Tianyi Zhang et al., “BERTScore: Evaluating Text Generation with BERT,” 2020, https://arxiv.org/abs/1904.09675.

tiktoken, OpenAI, 2023, accessed August 10, 2024, https://github.com/openai/tiktoken.

Tom Brown et al., “Language Models Are Few-Shot Learners,” in Advances in Neural Information Processing Systems 33, ed. H. Larochelle et al. (Neural Information Processing Systems Foundation, Inc., 2020), 1877–1901, https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.

Vladimir I. Levenshtein, “Binary Codes Capable of Correcting Deletions, Insertions, and Reversals,” Soviet Physics Doklady 10 (1966): 707–710.

W. G. Cochran, “The Comparison of Percentages in Matched Samples,” Biometrika 37, no. 3/4 (1950): 256–66, https://doi.org/10.2307/2332378.

“WARC (Web ARChive) File Format,” Library of Congress, https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtml.

William D. Mellin, “Work with New Electronic ‘Brains’ Opens Field for Army Math Experts,” Hammond Times 10, no. 66 (1957).

Xiao-Yang Liu et al., “FinGPT: Democratizing Internet-Scale Data for Financial Large Language Models,” 2023, accessed August 10, 2024, https://arxiv.org/abs/2307.10485.

Downloads

Published

2025-06-16

How to Cite

Nair, A., Goh, Z. R., Liu, T., & Huang, A. Y. (2025). Web Archives Metadata Generation with GPT-4o: Challenges and Insights. Information Technology and Libraries, 44(2). https://doi.org/10.5860/ital.v44i2.17305

Issue

Section

Articles