The project aim was to support small and medium publishing enterprises in embrace digital transformations
in terms of content indexing and discoverability. Concretely, the project goal was to start democratizing
access to knowledge graphs – the technology underpinning modern scientific search engines – for small and
medium publishers in the arts, humanities, and social sciences. Our plan has been to develop a framework
for extracting structured information from scholarly publications, for example references and indices, and
create knowledge graphs from it. The project was conducted in collaboration with the publisher Brill, which
provided a case study with its Classics collection of publications.
The goal of the project has been largely achieved. A workflow has been designed and implemented, capable
of taking raw publication data as input, detecting structured contents (reference lists and indices), extract and
disambiguate them into a knowledge graph. The workflow makes use of mainstream indexing services, such
as Crossref, and data models, such as the OpenCitations data model. All steps of the pipeline have been
tested in collaboration with Brill, resulting in the creation of the Brill’s Classics knowledge graph. The project
results are being documented in two scientific publications, while both codebase and Brill’s Classics
knowledge graph are to be openly released soon after the end of the project
The scientific publishing industry is rapidly transitioning towards information analytics. This shift is disproportionately benefiting large companies. These can afford to deploy digital technologies like knowledge graphs that can index their contents and create advanced search engines. Small and medium publishing enterprises, instead, often lack the resources to fully embrace such digital transformations. This divide is acutely felt in the arts, humanities and social sciences. Scholars from these disciplines are largely unable to benefit from modern scientific search engines, because their publishing ecosystem is made of many specialized businesses which cannot, individually, develop comparable services.
We propose to start bridging this gap by democratizing access to knowledge graphs – the technology underpinning modern scientific search engines – for small and medium publishers in the arts, humanities and social sciences. Their contents, largely made of books, already contain rich, structured information – such as references and indexes – which can be automatically mined and interlinked. We plan to develop a framework for extracting structured information and create knowledge graphs from it. We will as much as possible consolidate existing proven technologies into a single codebase, instead of reinventing the wheel.
Our consortium is a collaboration of researchers in scientific information mining, Odoma, an AI consulting company, and the publisher Brill, sharing its data and expertise. Brill will be able to immediately put to use the project results to improve its internal processes and services. Furthermore, our results will be published in open source with a commercial-friendly license, in order to foster the adoption and future development of the framework by other publishers. Ultimately, our proposal is an example of industry innovation where, instead of scaling-up, we scale wide by creating a common resource which many small players can then use and expand upon.