The recent contract signed between the Dutch research institutions and the publishers Elsevier mentions the possibility of an Open Knowledge Base (OKB), but the details are vague. This blog post looks some more about definitions of an OKB within the context of scholarly communications and elements that need to be taken into account in building one.
Readers may also be interested in contributing to the consultation that is being run as part of the Dutch Taskforce on Responsible Management of Research Information and Data . The VSNU will also be commissioning a feasibility study on the topic.
Authors: Alastair Dunning, Maurice Vanderfeesten, Sarah de Rijcke, Magchiel Bijsterbosch, Darco Jansen (all members of above taskforce)
Definition of an Open Knowledge Base
An Open Knowledge Base is a relatively new term, and liable to multiple interpretations. For clarification, we have listed some of the common features of an Open Knowledge Base (OKB):
- it hosts collections of metadata (descriptive data) as opposed to large collections of data (spreadsheets, images etc)
- the metadata is structured according to triples of subject object and predicate (eg The Milkmaid (subject) is painted by (predicate) Vermeer (object))
- each point of the triple is usually related to an identifier elsewhere, for example Vermeer in the OKB could be linked with reference to Vermeer in the Getty Art and Architecture thesaurus
- The highly structured nature of the metadata makes it easier for other computers to incorporate that data; OKBs have an important role to play for search engines such as Google as well as a basis for far-reaching analysis
- All the data (whether source or derived) is open for others to access and reuse, whether via an API, SPARQL endpoint, a data dump, or a simple interface, typically via a CC0 licence
- The data is described according according to existing standards, identifiers, ontologies and thesauri
- the rules for who can upload and edit the data will vary between OKB. All OKBs need to deal with a a tension between data extent, richness and quality
- The technical infrastructure is usually hosted in one place – however, the OKB will link to other OKBs to make a larger network of open metadata. In essence, this creates a federated infrastructure
- In some, but not all, cases, the OKB is not an end in itself but supplies the data that other services can build upon; thus there is a deliberate split between the underlying data and the services and tools that use that data
- An OKB share some aspects with Knowledge Base of Metadata on Scholarly Communication but is broader in both in terms of content and its commitment to openness
The best current example of an Open Knowledge Base is Wikidata. An example of a service built on top of Wikidata is Histopedia. Also library communities around the globe contribute journal titles to a Global Open Knowledge Base (GOKB).
Open Knowledge Bases and Scholarly Communication
Traditionally, metadata related to scholarly communications has been managed in discrete, unconnected, closed, commercial systems. Such collections of data have been closely tied to the interface to query the data. This restricts the power of the data – whoever creates the interface determines what types of questions can be asked.
An Open Knowledge Base counters this. Firstly, it separates the interface from the data. Secondly, it opens up and connects the underlying metadata to other sources of metadata. Such an approach allows much greater freedom – users are no longer restricted by the specific manner in which the interface was designed nor restricted to querying one set of metadata. Such openness makes the OKB flexible about the type of data it incorporates and when – other data providers with different datasets can connect or incorporate their data at a date that suits them. The openness also allows third parties to build specific interfaces and different services on top of the OKB.
For the field of scholarly communication, an ambitious federated metadata infrastructure would connect all sorts of entities, each with clear identifiers. Researchers, articles, books, datasets, research projects, research grants, organisations, organisational units, citations etc could all form part of a national OKB that connects to other OKBs. It would also help create enriched data, which could then be fed back into the OKB.
Such a richness of metadata would be a springboard for an array of services and tools to provide new analyses and insights on the evolution of scholarly communication in the Netherlands.
The best current example of an Open Knowledge Base for scholarly communication is developed by Open-Aire.
TIB Hannover is also developing an Open Research Knowledge Graph. Wikidata also holds plenty of metadata relating to scientific articles. A good example of enrichment services built on top of Open Knowledge Bases are Scholia, Semantic Scholar and Lens.org. Open Citations provides both a collection of aggregated data (on scholarly citations) and some basic tools to query it. The Global Open Knowledgebase is another example, with a focus on data needed by libraries to undertake collections management. The study by Ludo Waltman looks at further collections of open metadata.
Issues in constructing an Open Knowledge Base for the Netherlands (OKB-NL)
A well constructed open knowledge base can play a significant role in innovation and efficiency in the scholarly communications ecosystem. Given the breadth of data it can contain, it could be the engine for sophisticated research analysis tools. But it requires significant long-term engagement from multiple stakeholders, who will be both providing and consuming data. It is imperative that such stakeholders work in a collaborative fashion, according to an agreed set of principles.
The Dutch taskforce on Responsible Management of Research Information and Data has opened a consultation on these principles; readers of this blog are invited to contribute until Monday 8th of June 2020.
Whatever principles are used to underlie an OKB, there also needs to be serious thought given to practical concerns. How would an OKB be created and sustained? An OKB is an ambitious project; if it is to succeed it requires strong foundations. The following issues would all need to be addressed:
Who would steer the direction of the OKB? How would any board reflect the multiple research institutions contributing to the OKB? To make an OKB effective, it would require the ongoing participation of every research institution in NL – how would the business model ensure that? And who would actually do the day-to-day management of the OKB? What should be the role of commercial organisations contributing to the OKB and its underlying principles. Should they have a stake in the governance of an OKB?
Who would pay the initial costs for establishing an OKB? How would the ongoing cost be paid? Via institutional membership? Via consortium costs? Via government subsidy? Via public-private partnerships? Would all institutions gain equal benefit from the OKB? Would they pay different rates?
What kind of technical architecture does the OKB require – centralised, with all the data in one place, or distributed, with data residing in multiple locations? If the latter, how can we ensure that the data is open and interoperable? Or some kind of clever hybrid? Given its role as the foundation of other services, how can it be guaranteed that the OKB has close to 100% uptime as possible? And how can it be as responsive as possible, providing instantaneous responses to user demand?
Scope of Metadata Collection
The potential scope of an OKB is huge. Each content type has their own specific metadata schemes. These schemes evolve over time. How are different metadata types incorporated over time? Article metadata first? Then datasets, code, funding grants, projects, organisations, authors, journals? What about different versions of metadata schemes, need all backlog records be converted?
Quality, Provenance and Trust
Would the metadata in the OKB be sufficient to underpin high-quality services? What schema would need to be created for the different sorts of metadata? What critical mass of metadata would be required to create engaging services? What kind of and metadata alignment and enrichment would need to be undertaken? Would that be done centrally or by institutions and publishers? What costs would be associated with that? Would the costs be ongoing? Should provenance to the original supplier of the metadata and metadata enrichments be attributed?
Service development and Commercial engagement
What incentives would there be for commercial partners to a) provide metadata and b) build services on top of the OKB? Would the investment to develop such services simply lead to one or two big companies dominating the service offer? Would they compete with services not relying on the OKB? What would happen to enriched data created by commercial companies? Would it be returned to the OKB?
Would the resulting services be of use to all contributing members? Could the members develop their own services independent of commercial offerings?
Implementation timeline: Lean or Big Bang
When implementing the OKB, should we first carefully design the full stack of the infrastructure, and solve all the questions within the grand information architecture? Or let it grow organically, and start with collecting the metadata in the formats that is already legally available according to the publishing contracts? Can we do both in parallel; start collecting, and start designing?
As mentioned above, the VSNU will be commissioning a feasibility study of an Open Knowledge Base. In the meantime, Maurice Vanderfeesten has written a further blog on Solutions for constructing an Open Knowledge Base for the Netherlands (OKB-NL)