Category: Uncategorized

How to make research software a first-class citizen in the Netherlands?

Image credit: Clive Warneford / CC BY-SA

This blog is originally posted in the NL-RSE website and re-posted here.


TL;DR – Read our recommendations to make research software a first-class citizen.

Antoni van Leeuwenhoek is considered the father of microbiology. His discovery of microbes, using the lenses he made himself, created an entire new field of research. He was at the same time a researcher and a tool maker: his research would have not been possible without the tools he built. Leeuwenhoek was well known for both his discoveries in microbiology as well as the unmatched quality of his lenses.

Four centuries after Leeuwenhoek, research tools have only gained importance. Recently, a new type of tool gained critical importance: research software. The 2020 COVID-19 pandemic has brought up to the public eye how important research software is.

However, research software does not receive the recognition it deserves. A group of members of the NL-RSE network, and software minded data specialists, got together in an attempt to raise the profile of research software. Our position paper provides further details.

Back in March 2019, we had a meeting with NWO about the role of software in research. Following that meeting, we wrote a position paper with recommendations for funding agencies and research institutions to raise the profile of research software. In August 2019 made it publicly available for comments from the RSE community. In November 2019, we also had a feedback session during the NL-RSE conference. The author group got together again in January 2020 to integrate the community feedback. After a long revision process, the “final” version is ready. This paper focuses on the Netherlands, but the issues and recommendations could be adapted and adopted by other countries.

These recommendations have been broadly commented on however if you would like to comment on them feel free to reach out to any of the authors or contact us via the NL-RSE network.

Why figshare? Choosing a new technical infrastructure for 4TU.ResearchData

Written by Marta Teperek & Alastair Dunning

4TU.ResearchData is an international repository for research data in science, engineering and design. After over 10 years of using Fedora, an open source repository system, to run  4TU.ResearchData, we have made a decision to migrate a significant part of our technical infrastructure to a commercial solution offered by figshare. Why did we decide to do it? Why now, at a time of increasing concerns about relying on proprietary solutions, particularly associated with large publishing houses, to run scholarly communication infrastructures? (see for example, In pursuit of open science, open access is not enough and the SPARC Landscape Analysis)

We anticipate that members of our community, as well as colleagues that use or manage scholarly communications infrastructures might be wondering the same. We are therefore explaining our thinking in this blogpost, hoping it will facilitate more discussion about such developments in the scholarly communications infrastructure.

Why not continue with Fedora?

So, first, why not continue with Fedora? Any software, but open source software in particular, needs to be maintained. It’s a tough process. Maintenance means developers, which are often difficult to retain within academic environments (competitive salaries in industry); other developers, approaching retirement, are irreplaceable. We also faced the challenge of migrating to the next version of Fedora – a significant challenge simply to keep the repository running.

At the same time, researchers started requesting additional functionality: better statistics, restricted access to confidential datasets, integration with github, among many others. With insufficient development capacity it proved increasingly challenging to keep up with these demands. Undertaking a public tender for a managed repository platform, where developed efforts could be outsourced to the entity providing the repository solution, looked like the best way to deal with these twin challenges.

Why not Zenodo or Dryad? Or another open source repository?

Open Source advocates may ask why we did not try with open source repository solutions. We tried hard. We were in discussion with Zenodo (who are working on the Invenio out of the box repository solution), but the product was still at the pilot stage when we had to start our tender. We had discussions with Dryad, but Dryad’s offering at the time did not give us the functionality we required. Another university running an open source repository platform contacted us, but they withdrew in the end  – the tender process required too much bureaucracy.

We received no interest from other open source repository tools providers, despite utilising several data management and repository networks to share the information about the tender to solicit broader participation.

Tender

The next step was to start the public tender process. Within the EU, this is a compulsory step for transparency and accountability purposes at any public institution making purchases over a certain threshold. The tender process is an exhausting hurdle. But it does offer the opportunity to describe exactly the services and guarantees which are essential. This is very useful for building some security against vendor lock-in.

We had already made the decision to retain servers at TU Delft for data storage; additional requirements within the tender included the guarantee that all metadata was CC0; that import and export formats (eg JSON, XML) and protocols (a good API)  would be available and well documented; that an escape strategy would be supplied by the winning bidder, demonstrating the measures that would be enacted for DOIs, data, user information, metadata and webpages should either party wish to leave the contract. The winning bidder also offered to make its code open source should it ever cease development. Such arrangements provide some flexibility for us; if conditions change in the future we are in a position to change our technical infrastructure.

Why figshare?

There were two main bidders in the tender process; both coming from the commercial sector. Figshare turned out to have the better and more mature solution, and was the winner of the tender process.

Collaboration with figshare

We have now started working together with figshare and are just about to complete the migration process from the old Fedora repository to figshare.

What we have already noticed is the professionalism and responsiveness of figshare colleagues. They have a team of developers devoted to the product. We are pleased with figshare’s integration capability – data in and out via APIs, which enables opportunities to connect our repository with other tools and products used by our research community.

We are also pleased to see that figshare are interested in receiving and considering user feedback. They are now in the process of rethinking the whole metadata structure offered by the platform as a result of user feedback and are now considering potential future support for RDF format datasets. Such a move could enable greater interoperability of data.

But. Figshare is not an open source tool and it is one of the products offered by a technology company called Digital Science. Digital Science is part of Holtzbrinck Publishing group, which also owns the publishing giant Springer Nature. As mentioned before, there are concerns within the community about publishers getting a strong grip on scholarly communication infrastructure and data infrastructure in particular.

Future

Short-term (our contract with figshare is for three years), figshare promises to deliver functionalities for which our end user communities have been waiting for a long time. We are pleased working with them.

But long-term we are still interested in the alternatives. There are a number of initiatives, for example, Invest in Open Infrastructure, which aim to develop collaborative, viable and sustainable open source infrastructures for scholarly communications. These are crucial strategic investments for research institutions and time, money and expertise should be devoted to such activities.

The broader research community is still in need of open source alternatives, which can be developed and sustained in a collaborative manner.

Capacity needed

However, someone needs to develop, organise and sustain long-term maintenance of such open source alternatives. Who will that be? There are many organisations facing this challenge. 

So we should really invest in our capacity to collaborate on open source projects. Only then we will be able to co-develop much needed open source alternatives to proprietary products.

Short-term savings and long-term strategic plans are different matters and both require careful planning.

Lessons learnt

Finally, we also wanted to share some lessons learnt. 

More transparency and more evidence needed

Our first lesson learnt is that more transparency and more evidence comparing the costs of running an open source versus commercial infrastructures are needed. Many say that commercial, managed infrastructures are cheaper. However, implementation of such infrastructures does not happen at no cost. All the efforts involved in migration, customisations, communication etc are not negligible and apply to both open source software and commercial platforms. One recent publication suggests that the effort needed to sustain own open source infrastructure is comparable to that involved in implementing a third party solution in an institutional setting.

We need more evidence-based comparisons of running such infrastructures in scholarly communications settings. 

Easy to criticise. Easy to demand. But we need working, sustainable solutions.

Finally, we have received some criticism over our decision to migrate to figshare, in particular from Open Science advocates. 

While we acutely appreciate, understand and wholeheartedly support the strategic preference for Open Source infrastructures at academic institutions and in information management in particular, viable alternatives to commercial products are not always available in the short term.

We need to talk more and share much needed evidence and experience. We also need to invest in a skilled workforce and join forces to work together on developing viable solutions for open source infrastructures for scholarly communications, which hopefully will be coordinated by umbrella organisations such as Invest in Open Infrastructure.

Running Tender Processes make different workforce demands 

While outsourcing solves the problem of lack of developers, running an EU tender process creates other challenges. Tender processes are slow, cumbersome and require dedicated legal and procurement support. Discussions are no longer with in-house developers but with legal advisers. The procurement process requires numerous long documents, a forensic eye for detail, and an ability to explain and justify even the simplest functional demands. To ensure an equal and fair process, everything needs to be quantified. For example, one cannot simply require that an interface shows ‘good usability’ – the tender documents need to define good usability and indicate how it will be judged in the marking process. 

If others are undertaking the same process, they may wish to consult the published version of the tender document

We hope that the published tender document, as well as this blog post, might initiate greater discussion within the community about infrastructures for scholarly communication and encourage more sharing of evidence and experience.

Clarification

Added on 20 September 2020

We are grateful for all the comments and reactions we have received on our recent blog post “Why figshare? Choosing a new technical infrastructure for 4TU.ResearchData”.

Our intention behind the original post was to explain the processes behind our decision, as honestly as we possibly could. However, some of the comments we received made us realise that we unfairly portrayed our colleagues from the  Invenio and Dryad teams, as well as other colleagues supporting open source infrastructures. This is explained in the blog post “Sustainable, Open Source Alternatives Exist”, which was published as a reaction to our post. We apologise for this. 

We did not mean to imply in our post that sustainable open source alternatives do not exist. That is not what we think or believe. We also did not mean to imply that open source and hosted are mutually exclusive. 

We wholeheartedly agree with the remark that tenders are bureaucratic hurdles. However, tender processes are often favoured by big public institutions. The absence of open source infrastructure providers being able to successfully compete in such processes is an issue.

In the future, we would like to be involved in discussions about making tender processes accessible and fair to open source providers, or how to make alternatives to tender processes acceptable at large public institutions.

Remote ReproHacking

Author: Esther Plomp

The first Remote ReproHack was held on the 14th of May 2020. About 30 participants joined the online party with the mission to learn more about reproducibility and reproduce some papers! A ReproHack is a one day event where participants aim to reproduce papers of their choice, from a list of proposed papers of which the authors indicated that they would like to receive feedback on. The ReproHack aims to provide a safe space to provide constructive feedback, so that it is a valuable learning experience for the participants and the authors.

Recent studies and surveys have indicated that scientific papers can often not be reproduced because supporting data and code are not accessible or incorrect (see for example the Nature survey results here). In computational research only 26% of the papers are reproducible (Stodden 2018). To learn more about how these numbers can be improved, I joined the first ReproHack in the Netherlands last year. During this ReproHack I managed to reproduce the figures from a physics paper on Majorana bound states by André Melo and colleagues. I must admit that most of the work was done by Sander, who was very patient with my beginner Python skills. This year, I was set on trying to reproduce a paper that made use of R, a language that I learned to appreciate more since attending the Repro2020 course earlier this year.

The Remote Reprohack started with welcoming the participants through signing in on an online text document (HackMd) where we could list our names, affiliations and Twitter/GitHub information. This way we could learn more about the other participants. The check-in document also provided us with the schedule of the day, the list of research papers from which we could choose to reproduce, and the excellent code of conduct. After this digital check in and words of welcome, Daniel Nüst gave a talk about his work on improving the reproducibility of software and code. Next, Anna Krystalli, one of the organisers, took us through the process of how to reproduce and review the papers during the ReproHacking breakout sessions. During these breakout sessions the participants were ‘split up’ in smaller groups to work on the papers that they selected to reproduce. It was also possible to try to reproduce a paper by yourself.

Slide on CODECHECK from the presentation by Daniel Nüst

10:00 – Welcome and Intro to Blackboard Collaborate
10:10 – Ice breaker session in groups
10:20 – TALK: Daniel Nüst – Research compendia enable code review during peer review (
slides)
10:40 – TALK: Anna Krystalli
Tips and Tricks for Reproducing and Reviewing (slides)
11:00 – Select Papers
11:15 – Round I of ReproHacking (break-out rooms)
12:15 – Re-group and sharing of experiences
12:30 – LUNCH
13:30 – TALK: Daniel Piqué – How I discovered a missing data point in a paper with 8000+ citations
13:45 – Round II of ReproHacking (break-out rooms)
14:45 – COFFEE
15:00 – Round III of ReproHacking (break-out rooms) – Complete Feedback form
16:00 – Re-group and sharing of experiences
16:30 – TALK: Sarah Gibson – Sharing Reproducible Computational Environments with Binder (
slides) (see also here for materials from a Binder Workshop)
16:45 – Feedback and Closing

The participants had ~15 minutes to decide which paper we would like to reproduce from a list that contained almost 50 papers! The group that I joined was going to reproduce the preprint by Eiko Fried et al. on mental health and social contact during the COVID19 pandemic. Our group consisted of Linda Nab, one of the organisers of the ReproHack, Alessandro Gasparini (check out his work on INTEREST here if you work with simulations!), Anna Lohmann, Ciu and myself. The first session was spent by finding out how we could download all the data and code from the Open Science Framework. After we retrieved all the files, we had to download packages (or update R). During the second session we were able to do more reproducing rather than just getting set up. The work by Eiko Fried was well structured and documented, so after the initial problems with getting everything set up, the process of reproducing the work went quite smoothly. In the end, we managed to reproduce the majority of the paper!

Tweet by Eiko Fried on his experiences on submitting a paper for feedback to the Remote ReproHack.

In the third session feedback was provided to the authors of the papers that were being reproduced using the feedback form that the ReproHack team set up. This form contained questions about which paper was chosen, if the participants were able to reproduce the paper, and how much of the paper was reproduced. In more detail we could describe which procedure/tools/operating system/software that we used to reproduce the paper and how familiar we were with these. We also had to rate the reusability of the material, and indicate if the material had a licence. A very important section of the feedback form asked which challenges we ran into while trying to reproduce the paper, and what the positive features were. A separate section was dedicated on the documentation of the data and code, asking how well the material was documented. Additional suggestions and comments to improve the reproducibility were also welcomed.

After everyone returned from the last breakout sessions and filled in their feedback forms, the groups took turns to discuss whether they were able to reproduce the papers that they had chosen and if not, which challenges they faced. Most of the papers that were selected were reproduced by the participants. It was noted that especially proper documentation, such as a readme files, manuals and comments in the scripts themselves explaining the correct operating instructions to users, were especially helpful in reproducing someone else’s work.

Another way of improving the quality and reproducibility of the research is by asking your colleagues to reproduce your findings and offer them a co-author position (see this paper by Reimer et al. (2019) for more details on the ‘co-pilot system’). Some universities have dedicated services for checking the code and data before they are published (see this service at Cornell university).

There are several tools available to check and clean your data:

If you would like to learn more about ReproHacks, the Dutch ReproHack team wrote a paper on the Dutch ReproHack in November 2019. If you would like to participate, organise your own ReproHack, or contribute to the ReproHack work, the ReproHack team invites contributions on GitHub.

Anna Krystalli provided the Remote Reprohack participants with some additional resources to improve the reproducibility of our own papers:

FAIRsharing: how to contribute to standards?

Contributors in order of chronological contribution: Esther Plomp, Paula Martinez Lavanchy, Marta Teperek, Santosh Ilamparuthi, and Yasemin Turkyilmaz – van der Velden.

FAIRsharing organised a workshop for the Data Stewards and Champions at TU Delft on the afternoons of the 11th and 12th of June. We were joined by colleagues from University of Stuttgart, RWTH Aachen University, Technical University of Denmark (DTU), and the Swiss Federal Institute of Technology Lausanne (EPFL).

FAIRsharing is a cross-disciplinary platform that houses manually curated metadata on standards, databases and data policies. FAIRsharing works together with a large community that can add their metadata standards, policies and databases to the platform. You can view the introduction presentation here (see here for the slides).

During the first day of the workshop, which was led by Peter McQuilton, there was a demonstration of how to search FAIRsharing and how to apply the standards therein. The curation activities involved around the standards and databases in FAIRsharing were also explained in detail. On the second day, the participants discussed how to develop standards when there are no community-endorsed standards available and also how to contribute a standard to FAIRsharing. You can view a recording of the second day here (slides available here).

Day 1: FAIR and FAIRsharing

For anyone that has never heard of the FAIR principles (Findable, Accessible, Interoperable and Reusable), a short explanation is outlined below:

Findable

  • For your information/data to be findable, it needs to be discoverable on the web
  • It needs to be accompanied by a unique persistent identifier (e.g., DOI)

Accessible

  • For your information/data to be accessible, it needs to be clearly defined how this would be possible, and appropriate security protocols need to be in place (especially important for sensitive data which contains personal information)

Interoperable

  • For your information/data to be interoperable, it needs to be machine-actionable; it needs to be structured in a way that not only humans can interact with it, but also software/machines
  • Your data can be more easily integrated with data of other researchers when you use community adopted standards (formats and guidelines such as a report or publication)
  • You should link your information/data to other relevant resources

Reusable

  • For your information/data to be reusable, it needs to be clearly licensed, well documented and the provenance needs to be clear (for example, found in a community repository)

During our workshop, the FAIRsharing team highlighted that in order to make data truly FAIR, we need to have data standards! FAIRsharing helps people find disciplinary standards and provides support on application of standards. Delphine Dauga highlighted that it is important for communities to share vocabularies in order to effectively communicate with each other, as well as with machines.You can view the recording of her talk on the curation process of standards on FAIRsharing.org on YouTube.

You can contribute to FAIRsharing by adding standards. During the workshop we were guided by Allyson Lister through this process.

FAIRsharing also allows one to see the relationships between objects, which can be used to see how widely adopted a standard is. Example Graph of the recommended repositories by PLOS.

Day 2: How to contribute to, or develop, a standard?

To start off this day, a definition of “a standard” was given by Susanna Sansone. A standard is an agreed-upon convention for doing ‘something’, established by community consensus or an authority. For example, nuts and bolts are currently following international standards that outline their size, but this was not always the case (see below)!

Image from the slides by Susanne Sansone.

When you cannot find an applicable standard and you’re ready to work on a new standard, you should set up a community of governance for the standard. This means that a group should be established with individuals that have specific roles and tasks to work on the standard. Groups that are developing standards should have a code of conduct to successfully operate (for example, see The Turing Way code of conduct). There are different directions the group can take, one is to work under established or formal organisations which produce standards that might be adopted by industry (think of standards that govern the specifications of a USB drive), or grass-roots groups that form bottom up communities. There are advantages and limitations to both. The formal organisations already have  developmental processes in place which may not be flexible but can engender greater trust to the end users. The grass-roots groups, while not having an established community to begin with, provide greater flexibility and are often the route taken when developing research level standards. 

Development of a standards requires time and commitment

The standard needs to be tested and open to feedback, possibly multiple times over a long time period. The group needs to generate a web presence and share the different versions of the standard, ideally in a place that people can contribute to these versions (e.g., GitHub). It is desirable to use multiple communication channels to facilitate broad and inclusive contributions. These contributions do not stop when the standard is developed, but will need to be maintained and new requests for changes and contributions will have to be implemented. To maintain momentum, one should set clear timelines and ensure that there are moments where more intensive discussions can take place. This governance group needs to be sustainable. Sustainability can be ensured by dedicated funding, or by identifying other ways that can guarantee the maintenance of the group.

Community engagement

When working on new standards, it is good to first look at existing standards such as ISO/TC 276 or ISA, IEEE, ASTM, ANSI, and release any technical documentation that you have with practical examples so that all community members will be able to understand what needs to be done and contribute effectively. It also helps to create educational materials for diverse stakeholders to make it easier for them to engage with the development of the standard.

The success of grass-root governance groups depends on their ability to sustain the work in all phases, reward and incentivise all contributors, and deliver a standard that is fit for purpose. This is thus not primarily technical development but also depends on how well you are able to set up and maintain a community that contributes to the standard. After all, a standard is not going to adopt itself!

If you need more information on how you can maintain an (online) community, you can see this blog for some more pointers. 

FAIRsharing continues to grow and work with the community to ensure the metadata captured therein is as comprehensive and accurate as possible. To help with this, FAIRsharing is looking for users with specific domain experience to help with the curation of appropriate resources into FAIRsharing. This new initiative, to recruit Community Curators, will roll out over the summer. Please contact them (contact@fairsharing.org) to find out more!

Latest developments

The recommendation filter in the advanced search options on FAIRsharing.org.

FAIRsharing is in the process of integrating FAIRsharing with DMPonline. They are also setting up a collection of all the available tools to assess whether digital objects are FAIR on FAIRassist.org. FAIRsharing is also working on standard criteria for recommending data repositories (see below) so that publishers can assess whether they should endorse a certain data repository.  

FAIRsharing is currently being redesigned, with a new version being released by the end of 2020, and they are always happy to hear from you (through email, Facebook or Twitter) what is still missing!

What is an Open Knowledge Base anyway?

The recent contract signed between the Dutch research institutions and the publishers Elsevier mentions the possibility of an Open Knowledge Base (OKB), but the details are vague. This blog post looks some more about definitions of an OKB within the context of scholarly communications and elements that need to be taken into account in building one.

Readers may also be interested in contributing to the consultation that is being run as part of the Dutch Taskforce on Responsible Management of Research Information and Data . The VSNU will also be commissioning a feasibility study on the topic.

Authors: Alastair Dunning, Maurice Vanderfeesten, Sarah de Rijcke, Magchiel Bijsterbosch, Darco Jansen (all members of above taskforce)

Definition of an Open Knowledge Base

An Open Knowledge Base is a relatively new term, and liable to multiple interpretations. For clarification, we have listed some of the common features of an Open Knowledge Base (OKB):

  • it hosts collections of metadata (descriptive data) as opposed to large collections of data (spreadsheets, images etc) 

  • the metadata is structured according to triples of subject object and predicate (eg The Milkmaid (subject) is painted by (predicate) Vermeer (object))

  • each point of the triple is usually related to an identifier elsewhere, for example Vermeer in the OKB could be linked with reference to Vermeer in the Getty Art and Architecture thesaurus

  • The highly structured nature of the metadata makes it easier for other computers to incorporate that data; OKBs have an important role to play for search engines such as Google as well as a basis for far-reaching analysis

  • All the data (whether source or derived) is open for others to access and reuse, whether via an API, SPARQL endpoint, a data dump, or a simple interface, typically via a CC0 licence

  • The data is described according according to existing standards, identifiers, ontologies and thesauri

  • the rules for who can upload and edit the data will vary between OKB. All OKBs need to deal with a a tension between data extent, richness and quality  

  • The technical infrastructure is usually hosted in one place – however, the OKB will link to other OKBs to make a larger network of open metadata. In essence, this creates a federated infrastructure 

  • In some, but not all, cases, the OKB is not an end in itself but supplies the data that other services can build upon; thus there is a deliberate split between the underlying data and the services and tools that use that data   

  • An OKB share some aspects with Knowledge Base of Metadata on Scholarly Communication but is broader in both in terms of content and its commitment to openness

The best current example of an Open Knowledge Base is Wikidata. An example of a service built on top of Wikidata is Histopedia. Also library communities around the globe contribute journal titles to a Global Open Knowledge Base (GOKB).

Open Knowledge Bases and Scholarly Communication

Traditionally, metadata related to scholarly communications has been managed in discrete, unconnected, closed, commercial systems. Such collections of data have been closely tied to the interface to query the data. This restricts the power of the data – whoever creates the interface determines what types of questions can be asked.

An Open Knowledge Base counters this. Firstly, it separates the interface from the data. Secondly, it opens up and connects the underlying metadata to other sources of metadata. Such an approach allows much greater freedom – users are no longer restricted by the specific manner in which the interface was designed nor restricted to querying one set of metadata. Such openness makes the OKB flexible about the type of data it incorporates and when – other data providers with different datasets can connect or incorporate their data at a date that suits them. The openness also allows third parties to build specific interfaces and different services on top of the OKB.

A representation of the Open Knowledge Base, with an idea of how metadata is provided, enriched and then re-used
A representation of the Open Knowledge Base, with an idea of how metadata is provided, enriched and then re-used

For the field of scholarly communication, an ambitious federated metadata infrastructure would connect all sorts of entities, each with clear identifiers. Researchers, articles, books, datasets, research projects, research grants, organisations, organisational units, citations etc could all form part of a national OKB that connects to other OKBs. It would also help create enriched data, which could then be fed back into the OKB.

Such a richness of metadata would be a springboard for an array of services and tools to provide new analyses and insights on the evolution of scholarly communication in the Netherlands.

The best current example of an Open Knowledge Base for scholarly communication is developed by Open-Aire

The OpenAire Research Graph draws on data from many different scholarly communications tools


TIB Hannover is also developing an Open Research Knowledge Graph. Wikidata also holds plenty of metadata relating to scientific articles. A good example of enrichment services built on top of Open Knowledge Bases are Scholia, Semantic Scholar and Lens.orgOpen Citations provides both  a collection of aggregated data (on scholarly citations) and some basic tools to query it. The Global Open Knowledgebase is another example, with a focus on data needed by libraries to undertake collections management. The study by Ludo Waltman looks at further collections of open metadata.

Issues in constructing an Open Knowledge Base for the Netherlands (OKB-NL)

A well constructed open knowledge base can play a significant role in innovation and efficiency in the scholarly communications ecosystem. Given the breadth of data it can contain, it could be the engine for sophisticated research analysis tools. But it requires significant long-term engagement from multiple stakeholders, who will be both providing and consuming data. It is imperative that such stakeholders work in a collaborative fashion, according to an agreed set of principles. 

The Dutch taskforce on Responsible Management of Research Information and Data has opened a consultation on these principles; readers of this blog are invited to contribute until Monday 8th of June 2020.

Whatever principles are used to underlie an OKB, there also needs to be serious thought given to practical concerns. How would an OKB be created and sustained? An OKB is an ambitious project; if it is to succeed it requires strong foundations. The following issues would all need to be addressed:

Governance

Who would steer the direction of the OKB? How would any board reflect the multiple research institutions contributing to the OKB? To make an OKB effective, it would require the ongoing participation of every research institution in NL – how would the business model ensure that? And who would actually do the day-to-day management of the OKB? What should be the role of commercial organisations contributing to the OKB and its underlying principles. Should they have a stake in the governance of an OKB? 

Finance 

Who would pay the initial costs for establishing an OKB? How would the ongoing cost be paid? Via institutional membership? Via consortium costs? Via government subsidy? Via public-private partnerships? Would all institutions gain equal benefit from the OKB? Would they pay different rates?  

Technical

What kind of technical architecture does the OKB require – centralised, with all the data in one place, or distributed, with data residing in multiple locations?  If the latter, how can we ensure that the data is open and interoperable? Or some kind of clever hybrid? Given its role as the foundation of other services, how can it be guaranteed that the OKB has close to 100% uptime as possible? And how can it be as responsive as possible, providing instantaneous responses to user demand? 

Scope of Metadata Collection 

The potential scope of an OKB is huge. Each content type has their own specific metadata schemes. These schemes evolve over time. How are different metadata types incorporated over time? Article metadata first? Then datasets, code, funding grants, projects, organisations, authors, journals?  What about different versions of metadata schemes, need all backlog records be converted?

Quality, Provenance and Trust

Would the metadata in the OKB be sufficient to underpin high-quality services? What schema would need to be created for the different sorts of metadata? What critical mass of metadata would be required to create engaging services? What kind of and metadata alignment and enrichment would need to be undertaken? Would that be done centrally or by institutions and publishers? What costs would be associated with that? Would the costs be ongoing? Should provenance to the original supplier of the metadata and metadata enrichments be attributed?

Service development and Commercial engagement

What incentives would there be for commercial partners to a) provide metadata and b) build services on top of the OKB? Would the investment to develop such services simply lead to one or two big companies dominating the service offer? Would they compete with services not relying on the OKB? What would happen to enriched data created by commercial companies? Would it be returned to the OKB? 

Would the resulting services be of use to all contributing members? Could the members develop their own services independent of commercial offerings?  

Implementation timeline: Lean or Big Bang

When implementing the OKB, should we first carefully design the full stack of the infrastructure, and solve all the questions within the grand information architecture? Or let it grow organically, and start with collecting the metadata in the formats that is already legally available according to the publishing contracts? Can we do both in parallel; start collecting, and start designing?


As mentioned above, the VSNU will be commissioning a feasibility study of an Open Knowledge Base. In the meantime, Maurice Vanderfeesten has written a further blog on Solutions for constructing an Open Knowledge Base for the Netherlands (OKB-NL)

Solutions for constructing an Open Knowledge Base for the Netherlands (OKB-NL)

This blog post follows on from the earlier blog post on What is an Open Knowledge Base anyway? It is written by Maurice Vandenfeesten (VU, Amsterdam)

To give a hint on realising the OKB, we probably need to introduce two other concepts. One is the Start rating system of Data, and the other is building the OKB in two different phases.

Linked Open Data Star Rating

https://5stardata.info/ This is a concept introduced by Sir Tim Berners-Lee, to have not the internet presenting web pages that can be read by humans, but presenting data on the web that can be read and interpreted by machines, directly, interoperably, using a unified agreed standard; resource description format, or RDF. Putting your data in RDF on the web, gets you five stars. The vision of the OKB is to have all the metadata available as 5-star linked open data. This however is not the current reality. The data available, given by publishers and universities and put in the OKB are 3-star data at best; 1. Made available on the web (eg. in a data repository). 2. In a structured manner (eg. as a table or nested structure). 3. In a non-proprietary format (eg. csv, json, xml).

This brings us to the next concept.

IST and SOLL: OKB in different phases and different speeds

What we start right off. Building an OKB with what we have right now. Mature technology and robust services in Phase 1. And start building our envisioned OKB in Phase 2.

The following devices in the details and make things much more concrete, to make it tangible about what a phase 1 OKB can actually be.

IST: Start small and lean  – What can we do in the next couple of years? 

To make an initial start that is more feasible and to work on pilots, we need to work with the data and the data formats what systems already can deliver. 

Next will follow our thought-train how the phase 1 OKB should look like, but we love to hear yours in the comments below.

OKB; data repository for 3-star data

In this initial phase we appoint a data repository for the initial location for metadata providers to periodically deliver their metadata files under a CC0 license, including the information on the standard of the files delivered (how to interpret the syntax and semantics). This can be for example the 4TU.Datacentre or Dataverse.nl, where OKB deposits can be made into a separate collection/dataverse.

Services; working with 3-star data

The datasets are available to the service providers. They need to download the files and process it into their own data structure. Here, at the services level, the interlinks of the different information entities come into existence, and can be used for the purpose of the service.

Metadata-data Providers; delivering 3-star data

In our case we have different kinds of metadata-data providers. To name a few: 3rd parties, Universities, Funders. The 3rd parties can be Publishers, Indexes, Altmetrics. Each of which can deliver different information entities in the scholarly workflow, and can be delivering files in a different formats in an open standard with a CC0 license.

Delivery format for Publishers:
  • Article level data:
  • Machine readable Article Format: 
    • Standard: JATS-XMLJournal Article Tag Structure
    • Contains: 
      • Header: metadata including title, abstract, keywords, authors, affiliations, author roles
      • Footer: including reference lists
      • Body: including sections with paragraphs, tables, figures
    • Scope:  articles with Dutch authorship
  • Article level counter statistics
    • Standard: Article Level COUNTER-XML 
    • Contains: categorised statistics (view/download/deny/etc.) on different parts of the article
    • Scope: world wide usage of articles with Dutch authorship
Delivery format for Index providers (Scopus, WoS, Dimensions, etc)
  • Standard: The FULL record; incl (when applicable) keywords, abstracts, reference lists, etc. in JSON
  • Contains: Articles, Grants, Patents, Clinical Guidelines, etc 
  • Scope: entities with Dutch contributorship
Delivery format for Altmetrics providers (Altmetric, PlumX, etc)
  • Mentions of all types (News, Policy Docs, Trials, social media, etc) in JSON
  • Scope: All mentions mentioning Dutch research output (publications/datasets/etc)
Delivery format for Universities (CRIS) and Funders: 
  • Standard: CERIF-XML OpenAIRE-GL 
  • Contains: information about Organisational Units, Researchers, Projects, Publications, Datasets, Awarded Grants, etc.

All information entities need to be delivered as individual files, in a zipped package. That package must be logically aggregated and deposited, eg. by year, month, etc. Provenance metadata of the source providing the data and an open licence needs to be added. Also deposit with descriptive metadata, including pointers of the Open Standard of the datafiles, to adhere the FAIR principles. https://www.go-fair.org/fair-principles/

Service providers can then download the data from the OKB, and fill for example a search index with that information. This can then be used for example to enrich the metadata of the Dutch CRISsystems.

SOLL: the open knowledge base of the future

To stay true to the 5star Linked Open Data mindset, this OKB is an interconnected distributed network of data entities, where access and connectivity is maintained by each owner of the data notes. Those node owners can be the publishers, funders, universities. They can independently make claims, or assertions about the identifiers of the content types they maintain. 

For example, Publishers maintain identifiers on publication, universities about affiliations of researchers, orcid about researchers, funders about funded projects, etc. This interconnectivity is gained by the fact that, firstly, node owners can make claims about their content types in relation to other content types of other node owners. For example, a publisher can make the assertion that this publication is made with funds from that funded project, independently from the funder itself. 

Staying true to the old days of the internet, where everyone can make their own web page and link to others, without bi-directional approval. Secondly, making assertions using entities and relationships defined by the linked open data cloud. This assures interoperability in a way “machines” understand on a semantic level concepts they can use in their internal processes. For example, a data field called name, can be interpreted by one machine as a first name of a person, the other machine interpret this as an organisation name, or the name of an instrument. Using the ontologies of the linked open data cloud, can pin the exact semantic meaning to the field name.

To keep track of who made what assertion, provenance information is added. This way services are able to weigh assertions from one note owner differently, than the other. (More about that in Nano Publications www.nanopub.org )

Zooming out, we see the OKB, connected with the linked data cloud, as a “knowledge representation graph” that has numerous applications in answering the most complex research questions.

Movement-building from home, a participant view

Authors: Esther Plomp, Lena Karvovskaya, Yasemin Turkyilmaz – van der Velden

From the 14th of April until the 7th of May, the Mozilla Foundation launched the “Movement-building from home” – a series of online meetings. The topic of these meetings was activism, community building, and maintenance in the special circumstances around COVID-19. Below follows a summary with some of the key points out of these meetings and some resources that were brought together by all the participants.

Throughout these calls, it was inspiring to hear about the ways that people deal with the new situations caused by COVID-19. Everyone is experiencing similar challenges but shows the remarkable ability to adapt to these changes, and we felt connected through our compassion and understanding during these unusual times. 

Set up

The sessions were hosted by Abigail Cabunoc Mayes and Chad Sansing from Mozilla Foundation. There were four sessions per week to enable people to join at their preferred time. The calls were open to anyone interested in online community and movement building and sharing experiences. The notes and recordings are available online:

Each session started with a check-in where participants wrote some information about themselves in a collaborative google document, as well as their expectations of the call. After the check-in, the discussion topic of the week was introduced by Abby, as well as the functionalities of the tools used (Google Docs, Zoom). This was followed by some expectations that Abby and Chad had of the participants of the calls. To facilitate an inclusive and accommodating environment we were referred to the Community Participations Guidelines. Issues could be reported to either Abby and/or Chad. Now that a secure environment was established, the goals of the call were outlined based on the topics of the different weeks. After this introduction, the participants got to contribute their experiences on the topic. Abby and Chad summarized the experiences and added their comments to the document. In the next part, Abby and Chad introduced the content that they have prepared and answered questions. Every call included break-out rooms (2-3 people) where participants could have more intimate discussions related to the topic of the meeting. Finally, reflections and take away points from these break out discussions were summarised, and participants were directed to other resources and means to stay in touch with the community.

There is more to collaboration than you might think! Turing Way / Scriberia

Week 1: Online Meetings

The first week focussed on our positive and negative experiences with online meetings. The participants listed some successes and challenges:

See here for text format

To host a successful online meeting, you should first choose an accessible platform that meets the needs of your community in terms of privacy and safety (see some examples of platforms here). It should be clear what participants require from the call, and you should follow up with anyone that could not attend the meeting. You should be explicit about the types of contributions you expect from participants, such as note-taking, facilitating the discussion or keeping time. It is good to allow for asynchronous contribution through a collaborative note-taking document to make space for questions as well as contributions from anyone that could not attend. You should document your meeting through e.g., a recording, captioning, or a summary. To facilitate more interaction, participants can be split up into smaller groups using break-out rooms. When your meeting has ended, it should be clear what the next actions are, and how participants can stay in touch with you and each other.

Week 2: Community Care

The second week on ‘Online Meetings’ focussed on community care which was defined as:

all of the ways in which you show attention to and care for your community members across different dimensions of accessibility, equity, and inclusions, from caring meeting times to compensation to hitting pause when things go wrong to take care of people first, etc.

Community care is basically any care provided by an individual to benefit other people in their life. The participants listed some successes and challenges:

See here for text format

Here are the take home messages from this call:

  • Ensure belonging by MIMI (make it more inviting), set up enough structure to provide a clear purpose, while maintaining enough flexibility to care for each other, and people’s safety and privacy. 
  • Repeating foundational practices such as the Community Practice Guidelines while checking-in with the community members, and showing gratitude and recognition.
  • Flexibility and prioritization for adjusting to the new norms. What are elements you must sustain, what can be de-emphasized to reduce overwhelming?
  • Assessing needs, especially those around privacy and security and communicating risks involved with various platforms.
  • Being prepared about how to disagree. Taking an increased response time to overcome fear-driven defensiveness and sharing key information and gathering responses ahead of time to limit surprises.
  • Careful and caring moderation. Generating new communication channels when necessary while avoiding duplication/overload. 
  • Reframing professional development & training by asking what people need to do, by offering training on not only new online tools and risks involved but also on new life and work balance demands. Using collaboration and mentorship to show care and build capacity for continuity.
  • Opting-in social time to help members to feel belonging to their community by doing lightweight prompts such as google street map tours of hometowns, pet parades and virtual play dates.
  • Expect to make mistakes and rehearse taking responsibility and moving forward from them.
  • Ensuring sustainability by re-assessing roles, responsibilities, and contribution pathways, identifying what matters most to continue online, and scanning for funding opportunities. 

Week 3: Personal Ecology

Personal Ecology is a term that is not well known outside of Mozilla’s community. It refers to self-care in a wide sense of the word: things one does to stay happy, healthy, and engaged with one’s work.

Personal ecology means “To maintain balance, pacing and efficiency to sustain our energy over a lifetime.” – Rockwood Leadership Institute, Art of Leadership

At the beginning of the meeting, some prompts were offered to the participants:

See here for text format

The big idea behind personal is that taking care of oneself is among the responsibilities of an activist, leader or community manager. Self-care must be a strategic: It requires intent, caring, and frequent self-assessment and support from others.

“You can’t sustain a movement if you don’t sustain yourself.” – Akaya Windwoo

The crucial part of this call was a self-care assessment. The participants were invited to make a copy of the inventory prompts below. Ten minutes were devoted to ranking once response on each item from 1 (never) to 5 (always). 

  • I have time to play in ways that refresh and renew me.
  • I am energized and ready to go at the start of my day.
  • I regularly get a good night’s sleep.
  • I effectively notice and manage stress as it arises.
  • I can execute my current workload with ease and consistency.
  • I have time to daydream and reflect.
  • During the day I take time to notice when I’m hungry, tired, need a break, or other physical needs.
  • I periodically renew my energy through the day, every day. 
  • I eat food that satisfies me and sustains my energy throughout the day.
  • I often have ways to express my creativity.
  • I have time to enjoy my hobbies.
  • Those that love and care about me are happy with my life’s balance. 
  • I spend enough time with family and friends.
  • I take time to participate in fun activities with others.
  • I feel connected to and aware of my body’s needs.
  • I take time to pause and reset now and again.
  • I am satisfied with my balance of solitude and engagement with others.
  • I make time for joy and connection.
  • I feel at peace.
  • At the end of my day I am content and ready to sleep.

After the ranking was done, the participants were invited to make lists for themselves of:

  • Things to continue.   
  • Things to improve or increase.
  • Things to try or work towards.

The meeting was completed with everyone writing down one powerful next step they will take.

Week 4: Community management

The fourth week we were asked about the successes and challenges we experienced in community management. Several examples of successful online communities were listed by participants: Mozilla Open Leaders (including its “daughters”: Open Life Science, eLife innovation, Open Post Academics, and OpenScapes), the Carpentries, the Software Sustainability Institute, rOpenSci, The Turing Way, The Athenas, and the Center for Scientific Collaboration and Community Engagement (CSCCE).

See here for text format

In a time of crisis, such as during COVID-19, a community manager should give hope and be emphatic, but also be realistic and transparent about the situation. Abby introduced a community management principle: the Mountain of Engagement. A sustainable community should have two things: 1) new members and 2) a way for existing members to grow within the community. These two things involve different levels of engagement (on the Mountain). First there is the ‘discovery’ level, where members first hear about the community. Then there is the level of ‘first’ contact, where they first engage with the community. After first contact, new members can contribute to a community in the ‘participation’ phase. When this contribution continues they reach the ‘sustained participation’ level. They may also use the community as a network (‘Networked participation’) and eventually take more responsibilities in the project in the ‘leadership’ level. It is good practice to consider how you will engage your members through these various levels from the start of your project or community. Your members will have different requirements and needs, depending on which level they are at: 

  1. Discovery; where the promotion of your community is important, which can be done through having a public repository that has an open license so that it is clear for others what they can reuse. 
  2. First contact: your community needs to have a clear mission, and multiple communication channels to make it easy for people to get in touch. This includes offering some passive channels which allow them to just follow the community. 
  3. Participation: Personal invitations to contribute work best. In these invitations you should set clear expectations by having contributing guidelines and a code of conduct (to which members can contribute). It is also good practice to let your participants know how much time and effort is expected from them if they want to contribute. By allowing your members to contribute at their own terms you allow them to take ownership of their contributions.
  4. Sustained Participation: It is important to recognise the contributions of your community members, as well as to allow for their skills and interests to move the community forward inline with the community mission.
  5. Networked Participation: Your community should be open to mentorship and training possibilities to allow members to grow. You can also think about professional development and offer certificates to members. 
  6. Leadership: Leadership should be inclusive, and involve value exchanges. It should be clear what is expected of community members when they take responsibilities. Leadership can take many forms and can come from anyone within the community.
The community network as visualised by Turing Way / Scriberia

It is also important to recognise that your community members can move up and down these levels of the Mountain of Engagement. Sometimes they will even need to depart and come back to your community at another time. To help move members forward it is important to assign them timebound and specific tasks in accordance with their capacities and recognise their contributions. Not everyone in your community needs to contribute and engage at all opportunities.

In a time of crisis, it can also be important to focus on the things that really matter right now, rather than overburden your community members. Here it is important to ask your community members about their needs and preferences. You can, for example, consult them on their preferences for communication platforms in order to meet where they are. Patience and reflection are great goods in these situations, as they allow us to think more deeply about why we work in certain ways and what we can learn from working online. It is important to realise that anything we build up now can also be used when a time of crisis is over! 

Contributions by

Grant R. Vousden-Dishington / Mario García / Cornelius Kibelka / Kim Cressman / Ryan Pitts / Julien Brun / Eirini Zormpa / Di / Jennifer Polk / Tim Butcher / Jen Hernandez-Munoz / Barbara Paes / Anisha Fernando / Terra Graziani / Elio Campitelli / Kristen Thorp / Kevin Mulhern / Erin Dunigan / Cora Johnston / Meg O’Hearn / Emily Stovel / Una Lee / Marty Downs / Gabriela Mejias / Jason Heppler / Brandon Locke / Julie Lowndes / Debra Erickson / Chad Walker / Patricia Herterich / Bradly Alicea / Bhuvana Meenakshi / EN,Hi,Ta / Chiara Bertipaglia / Gayle Schechter / Oscar van Vliet / Lisa Bass / Samantha Teplitzky / Kevin Helfer / Stavana Strutz / Jessica Steelman / Hilary Ross / Aliya Reich / Carrie Kappel / Elizabeth Blackburn / Sarah Melton / Jodi Reeves Eyre / Daphne Ugarte / Verena Lindner / Zannah Marsh / Marilyn Pratt  / Kimani Nyoike/Maskani Ya Taifa / Joppe Hoekstra / Edoardo Viola / Rachael Ainsworth / Lucy Patterson / Merle von Wittich / Grace McPherson / Sara El-Gebali / Lucia / Naomi Alexander Naidoo / Gavin Fay / Kim Wilkens / Alan Berkowitz / Vinodh Ilangovan / Marijn / Jez Cope / Christina Rupprecht / Teo Comet / John Cummings / Hanan Elmasu / Harshil Agrawal / Brenda Hernandez / Christina Cantrill / Lis Sylvan / Dylan Roskams-Edris / Kate Nicholson / Maartje Eigeman / Dave Howcroft / Francesca Minelli / Brooke Brod / Steve Van Tuyl / Sharan Jaswal / Nicole Holgate / Elisabeth Sylvan / Anna Desponds / Emma Irwin / Konstantina / Daniel Sestrajcic / Camille Maumet / Mohammad Issa / Cassandra Gould van Praag / Sadik Shahadu / Rubén Martín / Ana Rosa Rizo-Centino / Kartik Choudhary / Erin Robinson / Sarah Dorman / Carla Garcia Z. / Noha Abdel Baky / Elizabeth Sarjeant / Leslie Hsu / Suzi Grishpul / Philo van Kemenade / Ioana Chiorean / Raven Sia / Jaana Pinheiro / Angela Li / Lewis Munyi / Chana Fitton / Callan Bignoli / Jona Azizaj / Malvika Sharan / Annette Brickle / Edwin / Anita Cheria / Eileen McNulty-Holmes / Pablo Chamorro / Emilie Socash / Trevor Haché / Lynn Gidluck / Izzy Czerveniak / Lara Therrien Boulos /