Utilizing AI/ML to Enhance Information Extraction, Organization, and Retrieval from Large-scale Archival Collections
Home | Schedule | Accepted Papers |
You are invited to participate in the 1st Workshop on Utilizing AI/ML to Enhance Information Extraction, Organization, and Retrieval from Large-scale Archival Collections, to be held as part of the ACM/IEEE Joint Conference on Digital Libraries 2024, Hong Kong, Dec 20th, 2024 (JCDL2024).
This workshop addresses the challenges faced by archivists, historians, and researchers in managing and utilizing large-scale digital archives. As the digitization of historical records has expanded access to vast collections, the complexity and volume of data have created significant obstacles in ensuring these resources are effectively accessible and usable. This workshop focuses on the application of artificial intelligence (AI) and machine learning (ML) to revolutionize the processing, analysis, and retrieval of large-scale archival collections. By integrating computational methods with archival practices, the workshop aims to explore novel frameworks, tools, and best practices that can advance the field of Computational Archival Science (CAS). The workshop will feature a combination of keynote presentations, oral sessions, and poster presentations to facilitate the exchange of ideas and collaboration among experts from various disciplines, including information science, computer science, and digital humanities. Participants of this workshop will gain insights into the latest advancements in AI/ML technologies for digital archives and contribute to shaping the future of archival science.
The digitization of historical records (also called “digital archives”) has opened a variety of large-scale digital collections to the world. The scale and complexity of digital archives are posing enormous challenges for both researchers and memory institutions. One of the generally acknowledged challenges by the archive community is discovering, using, and analyzing digital archives for public service. It is impossible for archivists or historians alone to find a magical solution that will instantly make digital records more accessible and useful [2, 5, 7, 8]. Recently, the computational archival science (CAS), which integrates computational methods and tools, such as artificial intelligence/machine learning (AI/ML) to the archival field to address large-scale digital records/archives processing, analysis, storage, and access [6], has been proposed and identified as a novel and effective approach to resolve the challenges. To promote the applications of AI/ML in digital archives, founding agencies such as Institute of Museum and Library Services (IMLS), National Historical Publications & Records Commission (NHPRC), and The National Endowment for the Humanities (NEH) in the United States have been funding more and more grants related to this area. Although scholars from different communities such as Information Science and Computer Science has explored the applications of natural language processing (NLP), semantic analysis (SA), computer version (CV) on different archival collections such as oral history, culture heritage, historical newspaper [1, 3, 10, 11], there are still some obstacles for archive professions and researchers using computational methods especial advanced techniques into practice: (1) Lacking high-quality, large-scale, and open-sourced corpus for developing effective ML/DL models. (2) Lacking AI/ML tools that developed for processing, annotating, analyzing, and visualizing large-scale multivariate heterogeneous archival data. (3) Lacking hands-on resources that can teach archive professions and researchers how to use AI/ML tools for dealing with large-scale archival collections in different scenarios. With the development of generative AI such as ChatGPT [4, 9, 12], it is even more beneficial and urgent to utilize AI/ML to enhance information extraction, organization, retrieval and other applications through developing more effective and reusable models and exploring the best practices. Therefore, we propose this workshop to gather researchers and practical users to initiate a collaborative platform for exchanging ideas, sharing pilot studies, and scoping future directions on this cutting-edge venue.
Topics of interest include, but not limited to the following:
Regular papers: All submissions must be written in English, following the ACM Proceedings template (10 pages for full papers and 4 pages for short papers exclusive of unlimited pages for references) and should be submitted as PDF files to EasyChair.
Poster & demonstration: We welcome detailed originality, early discoveries, work-in-progress and industrial applications of innovations in measurement science communication to be presented in special poster sessions, and possibly in 2-minute presentations in the main session. Some research track papers will also be invited to the poster track instead, although there will be no difference in the final proceedings between poster and research track submissions. These papers should follow the same format as the research track papers but can be shorter (2 pages for poster and demo papers).
All submissions will be reviewed by at least two independent reviewers. Please be aware of the fact that at least one author per paper needs to register for the workshop and attend the workshop to present the work. In case of no-show the paper (even if accepted) will be deleted from the proceedings and from the program.
Outcomes of this workshop will include:
All dates are Anywhere on Earth (AoE).
Deadline for submission: October 31, 2024
Deadline for submission: November 15, 2024
Notification of acceptance: November 24, 2024
Camera ready: December 5, 2024
Workshop date: December 20, 2024, 2:30 pm – 6:00 pm
Haihua Chen is an Assistant Professor in Data Science and the Director of the Intelligent Data Engineering and Analytics Lab in the Department of Information Science at UNT. Dr. Chen’s research focuses on building high-performance and reliable artificial intelligence systems by applying natural language processing and machine learning in important domains such as healthcare, legal, and digital libraries, with the mission of solving social problems in health, humanitarian aid, social justice, and sustainability. He has more than 10 years’ experience in AI/ML and co-authored over 50 articles in applied AI/ML in peer-reviewed journals and conferences in the last six years.
Jeonghyun Kim is professor in Information Science in the Department of Information Science at UNT. She is serving as Director of three IMLS funded grants. The “Connecting Communities with Libraries, Archives, and Historians through Oral Histories” focuses on building a national forum on best practices and strategies to respond to challenges around building, implementing, preserving, and accessing community oral history projects. The other two are related data science and data literacy for library and information professions in the 21st century. Her research areas include digital libraries and archives, data management and curation, and LIS workforce development. She is the editor-in-chief of The Electronic Library.
Xiaoguang Wang is a professor in the School of Information Management and the Director of Center for Digital Humanities at Wuhan University. He was a postdoctoral research fellow at Ritsumeikan University and a visiting scholar at University of Illinois at Urbana-Champaign. His research interests include semantic annotation of digital object and scientific discourse, knowledge Organization, digital humanities. He has published over 100 academic papers in peer reviewed journals and international conferences, such as Journal of Documentation, Scientometrics, JASIS&T. At present, he is a council member of Chinese Information Society of Social Sciences (CISCC) and co-director of CISCC Digital Humanities Affiliate.
Le Yang holds the position of Associate Vice Provost & University Librarian for Collections, Discovery, and Digital Strategy at the University of Oregon. His diverse portfolio encompasses responsibilities related to digital libraries, application development, collection strategies, resource description services, and special collections and university archives. Dr. Yang’s research interests encompass digital librarianship, digital systems, data governance, and data visualization. He has disseminated his research findings widely in conferences and journals and serves as a peer reviewer and editorial board member for multiple library and information science journals.
Wayne de Fremery is a professor of Information Science and Entrepreneurship, Director of the Francoise O. Lepage Center of Global Innovation at the Dominican University of California. He currently represents the Korean National Body at ISO as Convener of a working group on document description, processing languages, and semantic metadata (ISO/IEC JTC 1/SC 34 WG 9). He is also owner of Tamal Vista Insights LLC, an independent producer of software that democratizes access to artificial intelligence, as well as Director of the Korea Text Initiative at the Cambridge Institute for the Study of Korea. Work by Wayne has recently appeared in The Materiality of Reading, Library Hi Tech, A Companion to World Literature, Translation Review, and JASIS&T.
º Adam Becker, University of Oregon
º Yi Bu, Peking University
º Hsuanwei Chen, San Jose State University
º Ingo Frommholz, University of Wolverhampton
º Souvick Ghosh, San José State University
º Luling Huang, Missouri Western State University
º Tianji Jiang, University of California, Los Angeles
º Haoyong Lan, Carnegie Mellon University
º Yongjia Lei, University of Oregon
º Ying-Hsang Liu, Uppsala University and Chemnitz University of Technology
º Tony Russell-Rose, University of London
º Wenyi Shang, University of Missouri
º Yu Wang, University of Oregon
º Zhiwu Xie, University of California, Riverside
º Haopeng Yuan, University of Oregon
º Zhongda Zhang, University of Oklahoma
[1] Ali, D., Milleville, K., Verstockt, S., Van de Weghe, N., Chambers, S., & Birkholz, J. M. (2023). Computer vision and machine learning approaches for metadata enrichment to improve searchability of historical newspaper collections. Journal of Documentation.
[2] Carter, K. S., Gondek, A., Underwood, W., Randby, T., & Marciano, R. (2022). Using AI and ML to optimize information discovery in under-utilized, Holocaust-related records. AI & SOCIETY, 37(3), 837-858.
[3] Chen, H., Kim, J. A., Chen, J., & Sakata, A. (2024). Demystifying oral history with natural language processing and data analytics: a case study of the Densho digital collection. The Electronic Library, 42(4), 643-663.
[4] Haleem, A., Javaid, M., & Singh, R. P. (2022). An era of ChatGPT as a significant futuristic support tool: A study on features, abilities, and challenges. BenchCouncil transactions on benchmarks, standards and evaluations, 2(4), 100089.
[5] Hawkins, A. (2022). Archives, linked data and the digital humanities: increasing access to digitised and born-digital archives via the semantic web. Archival Science, 22(3), 319-344.
[6] Hedges, M., Marciano, R., & Goudarouli, E. (2022). Introduction to the special issue on computational archival science. ACM Journal on Computing and Cultural Heritage (JOCCH), 15(1), 1-2.
[7] Jaillant, L., Aske, K., Goudarouli, E., & Kitcher, N. (2022). Introduction: challenges and prospects of born-digital and digitized archives in the digital humanities. Archival Science, 22(3), 285-291.
[8] Jaillant, L., & Caputo, A. (2022). Unlocking digital archives: cross-disciplinary perspectives on AI and born-digital data. AI & society, 37(3), 823-835.
[9] Spennemann, D. H. (2023). ChatGPT and the generation of digitally born “knowledge”: How does a generative AI language model interpret cultural heritage values?. Knowledge, 3(3), 480-512.
[10] Wang, X., Song, N., Liu, X., & Xu, L. (2021). Data modeling and evaluation of deep semantic annotation for cultural heritage images. Journal of Documentation, 77(4), 906-925.
[11] Wang, X., Zhao, K., Zhang, Q., & Liu, C. (2024). Digital deduction theatre: An experimental methodological framework for the digital intelligence revitalisation of cultural heritage. In Intelligent Computing for Cultural Heritage (pp. 203-220). Routledge.
[12] Zhang, S., Hou, J., Peng, S., Li, Z., Hu, Q., & Wang, P. (2023). ArcGPT: A Large Language Model Tailored for Real-world Archival Applications. arXiv preprint arXiv:2307.14852.