Solutions for constructing an Open Knowledge Base for the Netherlands (OKB-NL)

This blog post follows on from the earlier blog post on What is an Open Knowledge Base anyway? It is written by Maurice Vandenfeesten (VU, Amsterdam)

To give a hint on realising the OKB, we probably need to introduce two other concepts. One is the Start rating system of Data, and the other is building the OKB in two different phases.

Linked Open Data Star Rating

https://5stardata.info/ This is a concept introduced by Sir Tim Berners-Lee, to have not the internet presenting web pages that can be read by humans, but presenting data on the web that can be read and interpreted by machines, directly, interoperably, using a unified agreed standard; resource description format, or RDF. Putting your data in RDF on the web, gets you five stars. The vision of the OKB is to have all the metadata available as 5-star linked open data. This however is not the current reality. The data available, given by publishers and universities and put in the OKB are 3-star data at best; 1. Made available on the web (eg. in a data repository). 2. In a structured manner (eg. as a table or nested structure). 3. In a non-proprietary format (eg. csv, json, xml).

This brings us to the next concept.

IST and SOLL: OKB in different phases and different speeds

What we start right off. Building an OKB with what we have right now. Mature technology and robust services in Phase 1. And start building our envisioned OKB in Phase 2.

The following devices in the details and make things much more concrete, to make it tangible about what a phase 1 OKB can actually be.

IST: Start small and lean  – What can we do in the next couple of years? 

To make an initial start that is more feasible and to work on pilots, we need to work with the data and the data formats what systems already can deliver. 

Next will follow our thought-train how the phase 1 OKB should look like, but we love to hear yours in the comments below.

OKB; data repository for 3-star data

In this initial phase we appoint a data repository for the initial location for metadata providers to periodically deliver their metadata files under a CC0 license, including the information on the standard of the files delivered (how to interpret the syntax and semantics). This can be for example the 4TU.Datacentre or Dataverse.nl, where OKB deposits can be made into a separate collection/dataverse.

Services; working with 3-star data

The datasets are available to the service providers. They need to download the files and process it into their own data structure. Here, at the services level, the interlinks of the different information entities come into existence, and can be used for the purpose of the service.

Metadata-data Providers; delivering 3-star data

In our case we have different kinds of metadata-data providers. To name a few: 3rd parties, Universities, Funders. The 3rd parties can be Publishers, Indexes, Altmetrics. Each of which can deliver different information entities in the scholarly workflow, and can be delivering files in a different formats in an open standard with a CC0 license.

Delivery format for Publishers:
  • Article level data:
  • Machine readable Article Format: 
    • Standard: JATS-XMLJournal Article Tag Structure
    • Contains: 
      • Header: metadata including title, abstract, keywords, authors, affiliations, author roles
      • Footer: including reference lists
      • Body: including sections with paragraphs, tables, figures
    • Scope:  articles with Dutch authorship
  • Article level counter statistics
    • Standard: Article Level COUNTER-XML 
    • Contains: categorised statistics (view/download/deny/etc.) on different parts of the article
    • Scope: world wide usage of articles with Dutch authorship
Delivery format for Index providers (Scopus, WoS, Dimensions, etc)
  • Standard: The FULL record; incl (when applicable) keywords, abstracts, reference lists, etc. in JSON
  • Contains: Articles, Grants, Patents, Clinical Guidelines, etc 
  • Scope: entities with Dutch contributorship
Delivery format for Altmetrics providers (Altmetric, PlumX, etc)
  • Mentions of all types (News, Policy Docs, Trials, social media, etc) in JSON
  • Scope: All mentions mentioning Dutch research output (publications/datasets/etc)
Delivery format for Universities (CRIS) and Funders: 
  • Standard: CERIF-XML OpenAIRE-GL 
  • Contains: information about Organisational Units, Researchers, Projects, Publications, Datasets, Awarded Grants, etc.

All information entities need to be delivered as individual files, in a zipped package. That package must be logically aggregated and deposited, eg. by year, month, etc. Provenance metadata of the source providing the data and an open licence needs to be added. Also deposit with descriptive metadata, including pointers of the Open Standard of the datafiles, to adhere the FAIR principles. https://www.go-fair.org/fair-principles/

Service providers can then download the data from the OKB, and fill for example a search index with that information. This can then be used for example to enrich the metadata of the Dutch CRISsystems.

SOLL: the open knowledge base of the future

To stay true to the 5star Linked Open Data mindset, this OKB is an interconnected distributed network of data entities, where access and connectivity is maintained by each owner of the data notes. Those node owners can be the publishers, funders, universities. They can independently make claims, or assertions about the identifiers of the content types they maintain. 

For example, Publishers maintain identifiers on publication, universities about affiliations of researchers, orcid about researchers, funders about funded projects, etc. This interconnectivity is gained by the fact that, firstly, node owners can make claims about their content types in relation to other content types of other node owners. For example, a publisher can make the assertion that this publication is made with funds from that funded project, independently from the funder itself. 

Staying true to the old days of the internet, where everyone can make their own web page and link to others, without bi-directional approval. Secondly, making assertions using entities and relationships defined by the linked open data cloud. This assures interoperability in a way “machines” understand on a semantic level concepts they can use in their internal processes. For example, a data field called name, can be interpreted by one machine as a first name of a person, the other machine interpret this as an organisation name, or the name of an instrument. Using the ontologies of the linked open data cloud, can pin the exact semantic meaning to the field name.

To keep track of who made what assertion, provenance information is added. This way services are able to weigh assertions from one note owner differently, than the other. (More about that in Nano Publications www.nanopub.org )

Zooming out, we see the OKB, connected with the linked data cloud, as a “knowledge representation graph” that has numerous applications in answering the most complex research questions.

One comment

  1. Pingback: What is an Open Knowledge Base anyway? | Open Working

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s