Antoni van Leeuwenhoek is considered the father of microbiology. His discovery of microbes, using the lenses he made himself, created an entire new field of research. He was at the same time a researcher and a tool maker: his research would have not been possible without the tools he built. Leeuwenhoek was well known for both his discoveries in microbiology as well as the unmatched quality of his lenses.
Four centuries after Leeuwenhoek, research tools have only gained importance. Recently, a new type of tool gained critical importance: research software. The 2020 COVID-19 pandemic has brought up to the public eye how important research software is.
However, research software does not receive the recognition it deserves. A group of members of the NL-RSE network, and software minded data specialists, got together in an attempt to raise the profile of research software. Our position paper provides further details.
Back in March 2019, we had a meeting with NWO about the role of software in research. Following that meeting, we wrote a position paper with recommendations for funding agencies and research institutions to raise the profile of research software. In August 2019 made it publicly available for comments from the RSE community. In November 2019, we also had a feedback session during the NL-RSE conference. The author group got together again in January 2020 to integrate the community feedback. After a long revision process, the “final” version is ready. This paper focuses on the Netherlands, but the issues and recommendations could be adapted and adopted by other countries.
These recommendations have been broadly commented on however if you would like to comment on them feel free to reach out to any of the authors or contact us via the NL-RSE network.
4TU.ResearchData is an international repository for research data in science, engineering and design. After over 10 years of using Fedora, an open source repository system, to run 4TU.ResearchData, we have made a decision to migrate a significant part of our technical infrastructure to a commercial solution offered by figshare. Why did we decide to do it? Why now, at a time of increasing concerns about relying on proprietary solutions, particularly associated with large publishing houses, to run scholarly communication infrastructures? (see for example,Inpursuit of open science, open access is not enough and the SPARC Landscape Analysis)
We anticipate that members of our community, as well as colleagues that use or manage scholarly communications infrastructures might be wondering the same. We are therefore explaining our thinking in this blogpost, hoping it will facilitate more discussion about such developments in the scholarly communications infrastructure.
Why not continue with Fedora?
So, first, why not continue with Fedora? Any software, but open source software in particular, needs to be maintained. It’s a tough process. Maintenance means developers, which are often difficult to retain within academic environments (competitive salaries in industry); other developers, approaching retirement, are irreplaceable. We also faced the challenge of migrating to the next version of Fedora – a significant challenge simply to keep the repository running.
At the same time, researchers started requesting additional functionality: better statistics, restricted access to confidential datasets, integration with github, among many others. With insufficient development capacity it proved increasingly challenging to keep up with these demands. Undertaking a public tender for a managed repository platform, where developed efforts could be outsourced to the entity providing the repository solution, looked like the best way to deal with these twin challenges.
Why not Zenodo or Dryad? Or another open source repository?
Open Source advocates may ask why we did not try with open source repository solutions. We tried hard. We were in discussion with Zenodo (who are working on the Invenio out of the box repository solution), but the product was still at the pilot stage when we had to start our tender. We had discussions with Dryad, but Dryad’s offering at the time did not give us the functionality we required. Another university running an open source repository platform contacted us, but they withdrew in the end – the tender process required too much bureaucracy.
We received no interest from other open source repository tools providers, despite utilising several data management and repository networks to share the information about the tender to solicit broader participation.
The next step was to start the public tender process. Within the EU, this is a compulsory step for transparency and accountability purposes at any public institution making purchases over a certain threshold. The tender process is an exhausting hurdle. But it does offer the opportunity to describe exactly the services and guarantees which are essential. This is very useful for building some security against vendor lock-in.
We had already made the decision to retain servers at TU Delft for data storage; additional requirements within the tender included the guarantee that all metadata was CC0; that import and export formats (eg JSON, XML) and protocols (a good API) would be available and well documented; that an escape strategy would be supplied by the winning bidder, demonstrating the measures that would be enacted for DOIs, data, user information, metadata and webpages should either party wish to leave the contract. The winning bidder also offered to make its code open source should it ever cease development. Such arrangements provide some flexibility for us; if conditions change in the future we are in a position to change our technical infrastructure.
There were two main bidders in the tender process; both coming from the commercial sector. Figshare turned out to have the better and more mature solution, and was the winner of the tender process.
Collaboration with figshare
We have now started working together with figshare and are just about to complete the migration process from the old Fedora repository to figshare.
What we have already noticed is the professionalism and responsiveness of figshare colleagues. They have a team of developers devoted to the product. We are pleased with figshare’s integration capability – data in and out via APIs, which enables opportunities to connect our repository with other tools and products used by our research community.
We are also pleased to see that figshare are interested in receiving and considering user feedback. They are now in the process of rethinking the whole metadata structure offered by the platform as a result of user feedback and are now considering potential future support for RDF format datasets. Such a move could enable greater interoperability of data.
But. Figshare is not an open source tool and it is one of the products offered by a technology company called Digital Science. Digital Science is part of Holtzbrinck Publishing group, which also owns the publishing giant Springer Nature. As mentioned before, there are concerns within the community about publishers getting a strong grip on scholarly communication infrastructure and data infrastructure in particular.
Short-term (our contract with figshare is for three years), figshare promises to deliver functionalities for which our end user communities have been waiting for a long time. We are pleased working with them.
But long-term we are still interested in the alternatives. There are a number of initiatives, for example, Invest in Open Infrastructure, which aim to develop collaborative, viable and sustainable open source infrastructures for scholarly communications. These are crucial strategic investments for research institutions and time, money and expertise should be devoted to such activities.
The broader research community is still in need of open source alternatives, which can be developed and sustained in a collaborative manner.
However, someone needs to develop, organise and sustain long-term maintenance of such open source alternatives. Who will that be? There are many organisations facing this challenge.
So we should really invest in our capacity to collaborate on open source projects. Only then we will be able to co-develop much needed open source alternatives to proprietary products.
Short-term savings and long-term strategic plans are different matters and both require careful planning.
Finally, we also wanted to share some lessons learnt.
More transparency and more evidence needed
Our first lesson learnt is that more transparency and more evidence comparing the costs of running an open source versus commercial infrastructures are needed. Many say that commercial, managed infrastructures are cheaper. However, implementation of such infrastructures does not happen at no cost. All the efforts involved in migration, customisations, communication etc are not negligible and apply to both open source software and commercial platforms. One recent publication suggests that the effort needed to sustain own open source infrastructure is comparable to that involved in implementing a third party solution in an institutional setting.
We need more evidence-based comparisons of running such infrastructures in scholarly communications settings.
Easy to criticise. Easy to demand. But we need working, sustainable solutions.
Finally, we have received some criticism over our decision to migrate to figshare, in particular from Open Science advocates.
While we acutely appreciate, understand and wholeheartedly support the strategic preference for Open Source infrastructures at academic institutions and in information management in particular, viable alternatives to commercial products are not always available in the short term.
We need to talk more and share much needed evidence and experience. We also need to invest in a skilled workforce and join forces to work together on developing viable solutions for open source infrastructures for scholarly communications, which hopefully will be coordinated by umbrella organisations such as Invest in Open Infrastructure.
Running Tender Processes make different workforce demands
While outsourcing solves the problem of lack of developers, running an EU tender process creates other challenges. Tender processes are slow, cumbersome and require dedicated legal and procurement support. Discussions are no longer with in-house developers but with legal advisers. The procurement process requires numerous long documents, a forensic eye for detail, and an ability to explain and justify even the simplest functional demands. To ensure an equal and fair process, everything needs to be quantified. For example, one cannot simply require that an interface shows ‘good usability’ – the tender documents need to define good usability and indicate how it will be judged in the marking process.
If others are undertaking the same process, they may wish to consult the published version of the tender document.
We hope that the published tender document, as well as this blog post, might initiate greater discussion within the community about infrastructures for scholarly communication and encourage more sharing of evidence and experience.
Our intention behind the original post was to explain the processes behind our decision, as honestly as we possibly could. However, some of the comments we received made us realise that we unfairly portrayed our colleagues from the Invenio and Dryad teams, as well as other colleagues supporting open source infrastructures. This is explained in the blog post “Sustainable, Open Source Alternatives Exist”, which was published as a reaction to our post. We apologise for this.
We did not mean to imply in our post that sustainable open source alternatives do not exist. That is not what we think or believe. We also did not mean to imply that open source and hosted are mutually exclusive.
We wholeheartedly agree with the remark that tenders are bureaucratic hurdles. However, tender processes are often favoured by big public institutions. The absence of open source infrastructure providers being able to successfully compete in such processes is an issue.
In the future, we would like to be involved in discussions about making tender processes accessible and fair to open source providers, or how to make alternatives to tender processes acceptable at large public institutions.
The first Remote ReproHack was held on the 14th of May 2020. About 30 participants joined the online party with the mission to learn more about reproducibility and reproduce some papers! A ReproHack is a one day event where participants aim to reproduce papers of their choice, from a list of proposed papers of which the authors indicated that they would like to receive feedback on. The ReproHack aims to provide a safe space to provide constructive feedback, so that it is a valuable learning experience for the participants and the authors.
Recent studies and surveys have indicated that scientific papers can often not be reproduced because supporting data and code are not accessible or incorrect (see for example the Nature survey results here). In computational research only 26% of the papers are reproducible (Stodden 2018). To learn more about how these numbers can be improved, I joined the first ReproHack in the Netherlands last year. During this ReproHack I managed to reproduce the figures from a physics paper on Majorana bound states by André Melo and colleagues. I must admit that most of the work was done by Sander, who was very patient with my beginner Python skills. This year, I was set on trying to reproduce a paper that made use of R, a language that I learned to appreciate more since attending the Repro2020 course earlier this year.
The Remote Reprohack started with welcoming the participants through signing in on an online text document (HackMd) where we could list our names, affiliations and Twitter/GitHub information. This way we could learn more about the other participants. The check-in document also provided us with the schedule of the day, the list of research papers from which we could choose to reproduce, and the excellent code of conduct. After this digital check in and words of welcome, Daniel Nüst gave a talk about his work on improving the reproducibility of software and code. Next, Anna Krystalli, one of the organisers, took us through the process of how to reproduce and review the papers during the ReproHacking breakout sessions. During these breakout sessions the participants were ‘split up’ in smaller groups to work on the papers that they selected to reproduce. It was also possible to try to reproduce a paper by yourself.
10:00 – Welcome and Intro to Blackboard Collaborate 10:10 – Ice breaker session in groups 10:20 – TALK: Daniel Nüst – Research compendia enable code review during peer review (slides) 10:40 – TALK: Anna Krystalli – Tips and Tricks for Reproducing and Reviewing (slides) 11:00 – Select Papers 11:15 – Round I of ReproHacking (break-out rooms) 12:15 – Re-group and sharing of experiences 12:30 – LUNCH 13:30 – TALK: Daniel Piqué – How I discovered a missing data point in a paper with 8000+ citations 13:45 – Round II of ReproHacking (break-out rooms) 14:45 – COFFEE 15:00 – Round III of ReproHacking (break-out rooms) – Complete Feedback form 16:00 – Re-group and sharing of experiences 16:30 – TALK: Sarah Gibson – Sharing Reproducible Computational Environments with Binder (slides) (see also here for materials from a Binder Workshop) 16:45 – Feedback and Closing
The participants had ~15 minutes to decide which paper we would like to reproduce from a list that contained almost 50 papers! The group that I joined was going to reproduce the preprint by Eiko Fried et al. on mental health and social contact during the COVID19 pandemic. Our group consisted of Linda Nab, one of the organisers of the ReproHack, Alessandro Gasparini (check out his work on INTEREST here if you work with simulations!), Anna Lohmann, Ciu and myself. The first session was spent by finding out how we could download all the data and code from the Open Science Framework. After we retrieved all the files, we had to download packages (or update R). During the second session we were able to do more reproducing rather than just getting set up. The work by Eiko Fried was well structured and documented, so after the initial problems with getting everything set up, the process of reproducing the work went quite smoothly. In the end, we managed to reproduce the majority of the paper!
In the third session feedback was provided to the authors of the papers that were being reproduced using the feedback form that the ReproHack team set up. This form contained questions about which paper was chosen, if the participants were able to reproduce the paper, and how much of the paper was reproduced. In more detail we could describe which procedure/tools/operating system/software that we used to reproduce the paper and how familiar we were with these. We also had to rate the reusability of the material, and indicate if the material had a licence. A very important section of the feedback form asked which challenges we ran into while trying to reproduce the paper, and what the positive features were. A separate section was dedicated on the documentation of the data and code, asking how well the material was documented. Additional suggestions and comments to improve the reproducibility were also welcomed.
After everyone returned from the last breakout sessions and filled in their feedback forms, the groups took turns to discuss whether they were able to reproduce the papers that they had chosen and if not, which challenges they faced. Most of the papers that were selected were reproduced by the participants. It was noted that especially proper documentation, such as a readme files, manuals and comments in the scripts themselves explaining the correct operating instructions to users, were especially helpful in reproducing someone else’s work.
Another way of improving the quality and reproducibility of the research is by asking your colleagues to reproduce your findings and offer them a co-author position (see this paper by Reimer et al. (2019) for more details on the ‘co-pilot system’). Some universities have dedicated services for checking the code and data before they are published (see this service at Cornell university).
There are several tools available to check and clean your data:
If you would like to learn more about ReproHacks, the Dutch ReproHack team wrote a paper on the Dutch ReproHack in November 2019. If you would like to participate, organise your own ReproHack, or contribute to the ReproHack work, the ReproHack team invites contributions on GitHub.
Anna Krystalli provided the Remote Reprohack participants with some additional resources to improve the reproducibility of our own papers:
The Turing Way: a lightly opinionated guide to reproducible data science.
Contributors in order of chronological contribution: Esther Plomp, PaulaMartinez Lavanchy, Marta Teperek, Santosh Ilamparuthi, and Yasemin Turkyilmaz – van der Velden.
FAIRsharing organised a workshop for the Data Stewards and Champions at TU Delft on the afternoons of the 11th and 12th of June. We were joined by colleagues from University of Stuttgart, RWTH Aachen University, Technical University of Denmark (DTU), and the Swiss Federal Institute of Technology Lausanne (EPFL).
FAIRsharing is a cross-disciplinary platform that houses manually curated metadata on standards, databases and data policies. FAIRsharing works together with a large community that can add their metadata standards, policies and databases to the platform. You can view the introduction presentation here (see here for the slides).
During the first day of the workshop, which was led by Peter McQuilton, there was a demonstration of how to search FAIRsharing and how to apply the standards therein. The curation activities involved around the standards and databases in FAIRsharing were also explained in detail. On the second day, the participants discussed how to develop standards when there are no community-endorsed standards available and also how to contribute a standard to FAIRsharing. You can view a recording of the second day here (slides available here).
Day 1: FAIR and FAIRsharing
For anyone that has never heard of the FAIR principles (Findable, Accessible, Interoperable and Reusable), a short explanation is outlined below:
For your information/data to be findable, it needs to be discoverable on the web
It needs to be accompanied by a unique persistent identifier (e.g., DOI)
For your information/data to be accessible, it needs to be clearly defined how this would be possible, and appropriate security protocols need to be in place (especially important for sensitive data which contains personal information)
For your information/data to be interoperable, it needs to be machine-actionable; it needs to be structured in a way that not only humans can interact with it, but also software/machines
Your data can be more easily integrated with data of other researchers when you use community adopted standards (formats and guidelines such as a report or publication)
You should link your information/data to other relevant resources
For your information/data to be reusable, it needs to be clearly licensed, well documented and the provenance needs to be clear (for example, found in a community repository)
During our workshop, the FAIRsharing team highlighted that in order to make data truly FAIR, we need to have data standards! FAIRsharing helps people find disciplinary standards and provides support on application of standards. Delphine Dauga highlighted that it is important for communities to share vocabularies in order to effectively communicate with each other, as well as with machines.You can view the recording of her talk on the curation process of standards on FAIRsharing.org on YouTube.
Day 2: How to contribute to, or develop, a standard?
To start off this day, a definition of “a standard” was given by Susanna Sansone. A standard is an agreed-upon convention for doing ‘something’, established by community consensus or an authority. For example, nuts and bolts are currently following international standards that outline their size, but this was not always the case (see below)!
When you cannot find an applicable standard and you’re ready to work on a new standard, you should set up a community of governance for the standard. This means that a group should be established with individuals that have specific roles and tasks to work on the standard. Groups that are developing standards should have a code of conduct to successfully operate (for example, see The Turing Way code of conduct). There are different directions the group can take, one is to work under established or formal organisations which produce standards that might be adopted by industry (think of standards that govern the specifications of a USB drive), or grass-roots groups that form bottom up communities. There are advantages and limitations to both. The formal organisations already have developmental processes in place which may not be flexible but can engender greater trust to the end users. The grass-roots groups, while not having an established community to begin with, provide greater flexibility and are often the route taken when developing research level standards.
Development of a standards requires time and commitment
The standard needs to be tested and open to feedback, possibly multiple times over a long time period. The group needs to generate a web presence and share the different versions of the standard, ideally in a place that people can contribute to these versions (e.g., GitHub). It is desirable to use multiple communication channels to facilitate broad and inclusive contributions. These contributions do not stop when the standard is developed, but will need to be maintained and new requests for changes and contributions will have to be implemented. To maintain momentum, one should set clear timelines and ensure that there are moments where more intensive discussions can take place. This governance group needs to be sustainable. Sustainability can be ensured by dedicated funding, or by identifying other ways that can guarantee the maintenance of the group.
When working on new standards, it is good to first look at existing standards such as ISO/TC 276 or ISA, IEEE, ASTM, ANSI, and release any technical documentation that you have with practical examples so that all community members will be able to understand what needs to be done and contribute effectively. It also helps to create educational materials for diverse stakeholders to make it easier for them to engage with the development of the standard.
The success of grass-root governance groups depends on their ability to sustain the work in all phases, reward and incentivise all contributors, and deliver a standard that is fit for purpose. This is thus not primarily technical development but also depends on how well you are able to set up and maintain a community that contributes to the standard. After all, a standard is not going to adopt itself!
If you need more information on how you can maintain an (online) community, you can see this blog for some more pointers.
FAIRsharing continues to grow and work with the community to ensure the metadata captured therein is as comprehensive and accurate as possible. To help with this, FAIRsharing is looking for users with specific domain experience to help with the curation of appropriate resources into FAIRsharing. This new initiative, to recruit Community Curators, will roll out over the summer. Please contact them (email@example.com) to find out more!
FAIRsharing is in the process of integrating FAIRsharing with DMPonline. They are also setting up a collection of all the available tools to assess whether digital objects are FAIR on FAIRassist.org. FAIRsharing is also working on standard criteria for recommending data repositories (see below) so that publishers can assess whether they should endorse a certain data repository.
FAIRsharing is currently being redesigned, with a new version being released by the end of 2020, and they are always happy to hear from you (through email, Facebook or Twitter) what is still missing!
Authors: Yasemin Turkyilmaz – van der Velden, Santosh Ilamparuthi, Marta Teperek
On 8th of June, we had an online meeting to discuss “How to champion our OS community?” with 32 participants and a very active discussion. The why, what and how of this meeting can be found below.
Why did this meeting happen?
The community started as Data Champions but has since evolved. Data Champions have brought in other topics such as reproducible computational workflows, open-source software, open access to publications, citizen science, open hardware etc. So, does the “data” in “data champions” properly reflect what the community has become?
The community is also very inclusive, and people who join sometimes do so because they want to learn from others. Is the name “champion” inclusive enough?
In parallel, there was the emergence of Open Science Communities in the Netherlands and most Dutch universities have one… Would we also like to have an Open Science Community, or is our Data Champions community the Open Science Community at Delft?…
What happened during this meeting?
After a brief introduction, we heard about the Open Science Communities in the Netherlands from Loek Brinkman, co-founder of the Open Science Community Utrecht which is the first Open Science Community in the Netherlands.Then Marta Teperek explained the pros and cons of possible ways of going forward which are:
Stay as “Data Champions”
Rebrand to ‘Open Science Community Delft’
Join an umbrella ‘Open Science Community Delft’*
* – if it comes to exist in the future
Then we split into smaller groups to discuss the pros and cons of each option. This was followed by each group reporting the outcomes of the group discussions.
What are the outcomes of this meeting?
The reporting of group discussions was followed by voting. 29 people participated and here are the results:
Stay “Data Champions”: 0 votes
Rebrand to ‘Open Science Community Delft’: 16 votes
Join an umbrella ‘Open Science Community Delft’: 13 votes
What are the next steps?
The outcomes of the discussions and voting results suggested that all participants agree with the TU Delft Data Champions community to be rebranded as Open Science Community Delft.
There was some confusion about the option “Join an umbrella ‘Open Science Community Delft’, as there is not yet an existing ‘Open Science Community Delft’ and why this option would be different than “Rebrand to Open Science Community Delft’’, as the members of the Open Science Community Delft can anyhow start up member initiatives focused on a specific practice or discipline. We already have examples of this with the TU Delft Data Champions Community:
TPM Data Champion Anneke Zuiderwijk has initiated and is regularly organizing Open Data Meetings.
Therefore Open Science Community Delft can be an umbrella community with member initiatives focused on a specific practice or discipline. Anyone is welcome to start such a subgroup and we are happy to support those interested in doing this.
All these outcomes, together with the meeting notes and recording were shared with the community. The community was given the opportunity to share their doubts, questions or feedback until 22 June. As there were no objections, on 23 June the final decision of rebranding the Data Champions community to Open Science Community Delft was shared with the community.
There will be a branding effort involved in this, which needs to be discussed. The community will be kept abreast about the next steps.
The recent contract signed between the Dutch research institutions and the publishers Elsevier mentions the possibility of an Open Knowledge Base (OKB), but the details are vague. This blog post looks some more about definitions of an OKB within the context of scholarly communications and elements that need to be taken into account in building one.
Authors: Alastair Dunning, Maurice Vanderfeesten, Sarah de Rijcke, Magchiel Bijsterbosch, Darco Jansen (all members of above taskforce)
Definition of an Open Knowledge Base
An Open Knowledge Base is a relatively new term, and liable to multiple interpretations. For clarification, we have listed some of the common features of an Open Knowledge Base (OKB):
it hosts collections of metadata (descriptive data) as opposed to large collections of data (spreadsheets, images etc)
the metadata is structured according to triples of subject object and predicate (eg The Milkmaid (subject) is painted by (predicate) Vermeer (object))
each point of the triple is usually related to an identifier elsewhere, for example Vermeer in the OKB could be linked with reference to Vermeer in the Getty Art and Architecture thesaurus
The highly structured nature of the metadata makes it easier for other computers to incorporate that data; OKBs have an important role to play for search engines such as Google as well as a basis for far-reaching analysis
All the data (whether source or derived) is open for others to access and reuse, whether via an API, SPARQL endpoint, a data dump, or a simple interface, typically via a CC0 licence
The data is described according according to existing standards, identifiers, ontologies and thesauri
the rules for who can upload and edit the data will vary between OKB. All OKBs need to deal with a a tension between data extent, richness and quality
The technical infrastructure is usually hosted in one place – however, the OKB will link to other OKBs to make a larger network of open metadata. In essence, this creates a federated infrastructure
In some, but not all, cases, the OKB is not an end in itself but supplies the data that other services can build upon; thus there is a deliberate split between the underlying data and the services and tools that use that data
An OKB share some aspects with Knowledge Base of Metadata on Scholarly Communication but is broader in both in terms of content and its commitment to openness
The best current example of an Open Knowledge Base is Wikidata. An example of a service built on top of Wikidata is Histopedia. Also library communities around the globe contribute journal titles to a Global Open Knowledge Base (GOKB).
Open Knowledge Bases and Scholarly Communication
Traditionally, metadata related to scholarly communications has been managed in discrete, unconnected, closed, commercial systems. Such collections of data have been closely tied to the interface to query the data. This restricts the power of the data – whoever creates the interface determines what types of questions can be asked.
An Open Knowledge Base counters this. Firstly, it separates the interface from the data. Secondly, it opens up and connects the underlying metadata to other sources of metadata. Such an approach allows much greater freedom – users are no longer restricted by the specific manner in which the interface was designed nor restricted to querying one set of metadata. Such openness makes the OKB flexible about the type of data it incorporates and when – other data providers with different datasets can connect or incorporate their data at a date that suits them. The openness also allows third parties to build specific interfaces and different services on top of the OKB.
For the field of scholarly communication, an ambitious federated metadata infrastructure would connect all sorts of entities, each with clear identifiers. Researchers, articles, books, datasets, research projects, research grants, organisations, organisational units, citations etc could all form part of a national OKB that connects to other OKBs. It would also help create enriched data, which could then be fed back into the OKB.
Such a richness of metadata would be a springboard for an array of services and tools to provide new analyses and insights on the evolution of scholarly communication in the Netherlands.
The best current example of an Open Knowledge Base for scholarly communication is developed by Open-Aire.
Issues in constructing an Open Knowledge Base for the Netherlands (OKB-NL)
A well constructed open knowledge base can play a significant role in innovation and efficiency in the scholarly communications ecosystem. Given the breadth of data it can contain, it could be the engine for sophisticated research analysis tools. But it requires significant long-term engagement from multiple stakeholders, who will be both providing and consuming data. It is imperative that such stakeholders work in a collaborative fashion, according to an agreed set of principles.
Whatever principles are used to underlie an OKB, there also needs to be serious thought given to practical concerns. How would an OKB be created and sustained? An OKB is an ambitious project; if it is to succeed it requires strong foundations. The following issues would all need to be addressed:
Who would steer the direction of the OKB? How would any board reflect the multiple research institutions contributing to the OKB? To make an OKB effective, it would require the ongoing participation of every research institution in NL – how would the business model ensure that? And who would actually do the day-to-day management of the OKB? What should be the role of commercial organisations contributing to the OKB and its underlying principles. Should they have a stake in the governance of an OKB?
Who would pay the initial costs for establishing an OKB? How would the ongoing cost be paid? Via institutional membership? Via consortium costs? Via government subsidy? Via public-private partnerships? Would all institutions gain equal benefit from the OKB? Would they pay different rates?
What kind of technical architecture does the OKB require – centralised, with all the data in one place, or distributed, with data residing in multiple locations? If the latter, how can we ensure that the data is open and interoperable? Or some kind of clever hybrid? Given its role as the foundation of other services, how can it be guaranteed that the OKB has close to 100% uptime as possible? And how can it be as responsive as possible, providing instantaneous responses to user demand?
Scope of Metadata Collection
The potential scope of an OKB is huge. Each content type has their own specific metadata schemes. These schemes evolve over time. How are different metadata types incorporated over time? Article metadata first? Then datasets, code, funding grants, projects, organisations, authors, journals? What about different versions of metadata schemes, need all backlog records be converted?
Quality, Provenance and Trust
Would the metadata in the OKB be sufficient to underpin high-quality services? What schema would need to be created for the different sorts of metadata? What critical mass of metadata would be required to create engaging services? What kind of and metadata alignment and enrichment would need to be undertaken? Would that be done centrally or by institutions and publishers? What costs would be associated with that? Would the costs be ongoing? Should provenance to the original supplier of the metadata and metadata enrichments be attributed?
Service development and Commercial engagement
What incentives would there be for commercial partners to a) provide metadata and b) build services on top of the OKB? Would the investment to develop such services simply lead to one or two big companies dominating the service offer? Would they compete with services not relying on the OKB? What would happen to enriched data created by commercial companies? Would it be returned to the OKB?
Would the resulting services be of use to all contributing members? Could the members develop their own services independent of commercial offerings?
Implementation timeline: Lean or Big Bang
When implementing the OKB, should we first carefully design the full stack of the infrastructure, and solve all the questions within the grand information architecture? Or let it grow organically, and start with collecting the metadata in the formats that is already legally available according to the publishing contracts? Can we do both in parallel; start collecting, and start designing?
To give a hint on realising the OKB, we probably need to introduce two other concepts. One is the Start rating system of Data, and the other is building the OKB in two different phases.
Linked Open Data Star Rating
https://5stardata.info/ This is a concept introduced by Sir Tim Berners-Lee, to have not the internet presenting web pages that can be read by humans, but presenting data on the web that can be read and interpreted by machines, directly, interoperably, using a unified agreed standard; resource description format, or RDF. Putting your data in RDF on the web, gets you five stars. The vision of the OKB is to have all the metadata available as 5-star linked open data. This however is not the current reality. The data available, given by publishers and universities and put in the OKB are 3-star data at best; 1. Made available on the web (eg. in a data repository). 2. In a structured manner (eg. as a table or nested structure). 3. In a non-proprietary format (eg. csv, json, xml).
This brings us to the next concept.
IST and SOLL: OKB in different phases and different speeds
What we start right off. Building an OKB with what we have right now. Mature technology and robust services in Phase 1. And start building our envisioned OKB in Phase 2.
The following devices in the details and make things much more concrete, to make it tangible about what a phase 1 OKB can actually be.
IST: Start small and lean – What can we do in the next couple of years?
To make an initial start that is more feasible and to work on pilots, we need to work with the data and the data formats what systems already can deliver.
Next will follow our thought-train how the phase 1 OKB should look like, but we love to hear yours in the comments below.
OKB; data repository for 3-star data
In this initial phase we appoint a data repository for the initial location for metadata providers to periodically deliver their metadata files under a CC0 license, including the information on the standard of the files delivered (how to interpret the syntax and semantics). This can be for example the 4TU.Datacentre or Dataverse.nl, where OKB deposits can be made into a separate collection/dataverse.
Services; working with 3-star data
The datasets are available to the service providers. They need to download the files and process it into their own data structure. Here, at the services level, the interlinks of the different information entities come into existence, and can be used for the purpose of the service.
Metadata-data Providers; delivering 3-star data
In our case we have different kinds of metadata-data providers. To name a few: 3rd parties, Universities, Funders. The 3rd parties can be Publishers, Indexes, Altmetrics. Each of which can deliver different information entities in the scholarly workflow, and can be delivering files in a different formats in an open standard with a CC0 license.
Contains: information about Organisational Units, Researchers, Projects, Publications, Datasets, Awarded Grants, etc.
All information entities need to be delivered as individual files, in a zipped package. That package must be logically aggregated and deposited, eg. by year, month, etc. Provenance metadata of the source providing the data and an open licence needs to be added. Also deposit with descriptive metadata, including pointers of the Open Standard of the datafiles, to adhere the FAIR principles. https://www.go-fair.org/fair-principles/
Service providers can then download the data from the OKB, and fill for example a search index with that information. This can then be used for example to enrich the metadata of the Dutch CRISsystems.
SOLL: the open knowledge base of the future
To stay true to the 5star Linked Open Data mindset, this OKB is an interconnected distributed network of data entities, where access and connectivity is maintained by each owner of the data notes. Those node owners can be the publishers, funders, universities. They can independently make claims, or assertions about the identifiers of the content types they maintain.
For example, Publishers maintain identifiers on publication, universities about affiliations of researchers, orcid about researchers, funders about funded projects, etc. This interconnectivity is gained by the fact that, firstly, node owners can make claims about their content types in relation to other content types of other node owners. For example, a publisher can make the assertion that this publication is made with funds from that funded project, independently from the funder itself.
Staying true to the old days of the internet, where everyone can make their own web page and link to others, without bi-directional approval. Secondly, making assertions using entities and relationships defined by the linked open data cloud. This assures interoperability in a way “machines” understand on a semantic level concepts they can use in their internal processes. For example, a data field called name, can be interpreted by one machine as a first name of a person, the other machine interpret this as an organisation name, or the name of an instrument. Using the ontologies of the linked open data cloud, can pin the exact semantic meaning to the field name.
To keep track of who made what assertion, provenance information is added. This way services are able to weigh assertions from one note owner differently, than the other. (More about that in Nano Publications www.nanopub.org )
Zooming out, we see the OKB, connected with the linked data cloud, as a “knowledge representation graph” that has numerous applications in answering the most complex research questions.