The main obstacles to better research data management and sharing are cultural. But change is in our hands

change-948024_1280

This blog post was originally published by the LSE Impact Blog.


Recommendations on how to better support researchers in good data management and sharing practices are typically focused on developing new tools or improving infrastructure. Yet research shows the most common obstacles are actually cultural, not technological. Marta Teperekand Alastair Dunning outline how appointing data stewards and data champions can be key to improving research data management through positive cultural change.

This blog post is a summary of Marta Teperek’s presentation at today’s Better Science through Better Data 2018 event.


By now, it’s probably difficult to find a researcher who hasn’t heard of journal requirements for sharing research data supporting publications. Or a researcher who hasn’t heard of funder requirements for data management plans. Or of institutional policies for data management and sharing. That’s a lot of requirements! Especially considering data management is just one set of guidelines researchers need to comply with (on top of doing their own competitive research, of course).

All of these requirements are in place for good reasons. Those who are familiar with the research reproducibility crisis and understand that missing data and code is one of the main reasons for it need no convincing of this. Still, complying with the various data policies is not easy; it requires time and effort from researchers. And not all researchers have the knowledge and skills to professionally manage and share their research data. Some might even wonder what exactly their research data is (or how to find it).

Therefore, it is crucial for institutions to provide their researchers with a helping hand in meeting these policy requirements. This is also important in ensuring policies are actually adhered to and aren’t allowed to become dry documents which demonstrate institutional compliance and goodwill but are of no actual consequence to day-to-day research practice.

The main obstacles to data management and sharing are cultural

But how to best support researchers in good data management and sharing practices? The typical answers to these questions are “let’s build some new tools” or “let’s improve our infrastructure”. When thinking how to provide data management support to researchers at Delft University of Technology (TU Delft), we decided to resist this initial temptation and do some research first.

Several surveys asking researchers about barriers to data sharing indicated that the main obstacles are cultural, not technological. For example, in a recent survey by Houtkoop at el. (2018), psychology researchers were given a list of 15 different barriers to data sharing and asked which ones they agreed with. The top three reasons preventing researchers from sharing their data were:

  1. “Sharing data is not a common practice in my field.”
  2. “I prefer to share data upon request.”
  3. “Preparing data is too time-consuming.”

Interestingly, the only two technological barriers – “My dataset is too big” and “There is no suitable repository to share my data” – were among three at the very bottom of the list. Similar observations can be made based on survey results from Van den Eynden et al. (2016) (life sciences, social sciences, and humanities disciplines) and Johnson et al. (2016) (all disciplines).

At TU Delft, we already have infrastructure and tools for data management in place. The ICT department provides safe storage solutions for data (with regular backups at different locations), while the library offers dedicated support and templates for data management plans and hosts 4TU.Centre for Research Data, a certified and trusted archive for research data. In addition, dedicated funds are made available for researchers wishing to deposit their data into the archive. This being the case, we thought researchers may already receive adequate data management support and no additional resources were required.

To test this, we conducted a survey among the research community at TU Delft. To our surprise, the results indicated that despite all the services and tools already available to support researchers in data management and sharing activities, their practices needed improvement. For example, only around 40% of researchers at TU Delft backed up their data automatically. This was striking, given the fact that all data storage solutions offered by TU Delft ICT are automatically backed up. Responses to open questions provided some explanation for this:

  • “People don’t tell us anything, we don’t know the options, we just do it ourselves.”
  • “I think data management support, if it exists, is not well-known among the researchers.”
  • “I think I miss out on a lot of possibilities within the university that I have not heard of. There is too much sparsely distributed information available and one needs to search for highly specific terminology to find manuals.”

It turns out, again, that the main obstacles preventing people from using existing institutional tools and infrastructure are cultural – data management is not embedded in researchers’ everyday practice.

How to change data management culture?

We believe the best way to help researchers improve data management practices is to invest in people. We have therefore initiated the Data Stewardship project at TU Delft. We appointed dedicated, subject-specific data stewards in each faculty at TU Delft. To ensure the support offered by the data stewards is relevant and specific to the actual problems encountered by researchers, data stewards have (at least) a PhD qualification (or equivalent) in a subject area relevant to the faculty. We also reasoned that it was preferable to hire data stewards with a research background, as this allows them to better relate to researchers and their various pain points as they are likely to have similar experiences from their own research practice.

Vision for data stewardship

There are two main principles of this project. Crucially, the research must stay central. Data stewards are not there to educate researchers on how to do research, but to understand their research processes and workflows and help identify small, incremental improvements in their daily data management practices.

Consequently, data stewards act as consultants, not as police (the objective of the project is to improve cultures, not compliance). The main role of the data stewards is to talk with researchers: to act as the first contact point for any data-related questions researchers might have (be it storage solutions, tools for data management, data archiving options, data management plans, advice on data sharing, budgeting for data management in grant proposals, etc.).

Data stewards should be able to answer around 80% of questions. For the remaining 20%, they ask internal or external experts for advice. But most importantly, researchers no longer need to wonder where to look for answers or who to speak with – they have a dedicated, local contact point for any questions they might have.

Data Champions are leading the way

So has the cultural change happened? This is, and most probably always be, a work in progress. However, allowing data stewards to get to know their research communities has already had a major positive effect. They were able to identify researchers who are particularly interested in data management and sharing issues. Inspired by the University of Cambridge initiative, we asked these researchers if they would like to become Data Champions – local advocates for good data management and sharing practices. To our surprise, more than 20 researchers have already volunteered as Data Champions, and this number is steadily growing. Having Data Champions teaming up with the data stewards allows for the incorporation of peer-to-peer learning strategies into our data management programme and also offers the possibility to create tailored data management workflows, specific to individual research groups.

Technology or people?

Our case at TU Delft might be quite special, as we were privileged to already have the infrastructure and tools in place which allowed us to focus our resources on investing in the right people. At other institutions circumstances may be different. Nonetheless, it’s always worth keeping in mind that even the best tools and infrastructures, without the right people to support them (and to communicate about them!), may fail to be widely adopted by the research community.

TU Delft’s presence at the International Data Week 2018 in Botswana

IDW2017Slide1

Yasemin Turkyilmaz-van der Velden and Marta Teperek are very privileged to represent TU Delft at the International Data Week 2018 in Gaborone, Botswana. Yasemin has been awarded a very competitive grant for Early Career researchers to attend the conference.

We are working hard to make sure that we get the most of our attendance. Yasemin is presenting:

Marta’s contribution:

The full programme of the International Data Week can be accessed online.

 

 

Workshop Report: Software Reproducibility – The Nuts and Bolts

Authors (in alphabetical order): Maria Cruz (VU), Marc Galland (UvA), Carlos Martinez (NL eScience Center), Raúl Ortiz (TU Delft), Esther Plomp (VU), Anita Schürch (UMCU), Yasemin Türkyilmaz-van der Velden (TU Delft)

sam-loyd-1066267-unsplash

Based on the contributions from workshop participants (in alphabetical order): Joke Bakker (University of Groningen), Jochem Bijlard (The Hyve), Mattias de Hollander (NIOO-KNAW), Joep de Ligt (UMCU), Albert Gerritsen (Radboud UMC) Thierry Janssens (rivm), Victor Koppejan (TU Delft), Brett Olivier (Vrije Universiteit Amsterdam), Raúl Ortiz (TU Delft), Esther Plomp (Vrije Universiteit Amsterdam), Jorrit Posthuma (ENPICOM), Anita Schürch (UMCU)


On 2 October 2018, Maria Cruz (VU), Marc Galland (UvA), Carlos Martinez (NL eScienceCenter), and Yasemin Türkyilmaz-van der Velden (TU Delft) facilitated a workshop titled “Software Reproducibility – The Nuts and Bolts”, as part of the DTL Communities@Work 2018 event held in Utrecht, the Netherlands.

Besides the four organisers, there were 24 workshop participants, including researchers, research software engineers/developers, data stewards and others in research support roles.

Below we summarise the background and rationale for the workshop, key discussions and insights, and recommendations. The description of the workshop setup, including information about the participants gathered via Mentimeter, can be found at the end of this report.

The listed authors include the four organisers and the workshop participants who actively contributed to the report. Workshop participants who agreed to be acknowledged for their contributions are also listed.

Rationale for the workshop

The starting point for the workshop was a paper published in Water Resources Research by Hut, van de Giesen and Drost (2017), which argues that carefully documenting and archiving code and research data may not enough to guarantee the reproducibility of computational results. Alongside the use of the current best practices in scientific software development, these authors recommend close collaboration between scientists and research software engineers (RSEs) to ensure scientists are aware of the latest computational advances, most notably the use of containers (e.g. Docker) and open interfaces.

As happened in a previous similar workshop held at TU Delft on 24 May 2018, the participants discussed the merits of these recommendations and how they could be put into practice; and also what role the various stakeholders (researchers, research software engineers, research institutions, data stewards and other research support staff) could play in this regard.

In this second edition of the workshop, the participants also made recommendations for actions that could be taken at the national level in the Netherlands to raise the awareness of software sustainability and reproducibility and to implement the advice from the paper and the workshop. The key discussion points and insights from these discussions and the ensuing recommendations are summarised below, based on information recorded during the workshop within a collaborative google document.

IMG_20181002_112817

In this report we define reproducibility and reusability of software as follows. Reproducibility is focused on being able to reproduce results obtained in the past – that is, use the same data and the same software to reach the same result (a docker image may be good enough for this). Reusability is concerned with using the software on a different context than it was used before; this could be as simple as using the same software with different data or it may require modification of the original software (docker images may or may not be sufficient for software reusability).

Key discussion points and insights on the advice by Hut, van de Giesen and Drost

Sound but too technical advice

Overall, the groups felt that the advice was sound but too technically focussed, particularly if it is aimed at researchers. Researchers should not need to concern themselves with containers and open API’s, which are too technical to implement. The advice also fails to consider and recognise deeper cultural issues, such as: the lack of awareness on the topic of reproducibility and reusability of research software; the lack of relevant training, tools and support; and the diversity of code.

Concerns regarding the use of containers

Docker may not necessarily be easy to use if you are not a software developer or research software engineer. There was also the concern that containers, although helpful, should not be used to mask bad coding practices. The use of containers also makes it difficult to upgrade the software. Containers make it easier to distribute software on the short term, but to make software sustainable someone needs to understand how to update and build a new container. This is a role for the research software engineers, not the researchers, as there are no tools that are easy to use that allow for re-use of software in different containers. The other issue that was raised was whether Docker and other platforms would still exist in 20 years’ time.

Not all code is equal

Not all software is meant to be maintained or reused. High software quality, version management, code review, etc. will all help with reproducibility and reusability, but at some point in time the software might not be sustainable anymore. Is this necessarily bad? Code from 10 years ago will probably need to be rewritten in newer languages. Defining the scope of code will help determine the level of reproducibility and reusability requirements. In particular, it is important to differentiate between single-use scripts and pipelines that are used repeatedly and/or by different people. While the former do not need to be highly maintained, the latter need to be extensively reviewed and tested. Commercial software is also an issue. In some fields of research, many scientists use Excel or MATLAB. Commercial software is often closed source, making it difficult to test, review and publish it; and sometimes the publication of the code is also not possible for IP or confidentiality reasons.

Training and raising awareness

How much are researchers aware of the reproducibility crisis? Researchers need to be aware of the key features and concepts behind reproducibility and reusability of research software. These concepts are more important than any particular techniques. The first step should thus be raising awareness of these issues. People who are already aware of the reproducibility crisis and of practices conducive to reproducibility and their practical benefits have a responsibility to raise awareness within their department/group/colleagues. Researchers also need to be aware of the possibilities and best practices in order to apply them. Training is important in this regard. Having the right tools and support is also essential. Researchers need to know who to contact for help and support and how to find the right tools.

Code review

There should be code review sessions involving all the interested parties. Code review could be similar to peer review and be done at the institutional or departmental level. Working together on software increases the quality of code, particularly if it is reviewed by multiple stakeholders. Sharing the experience and the knowledge gained from these code review sessions more widely would provide a way to advertise and advocate for the best practices in software development.

Community building

Building a community behind a particular tool or piece of software was also seen as a good way to ensure that code is maintained and upgraded. If the software is out there and there is interest in it, people will maintain it. Being a part of such community may not necessarily require specific expertise and technical involvement. A user of a tool can very well contribute to the community by raising issues without needing to have specific knowledge about the code.

Good practices in scientific software development

Good coding practices should be publicly available and widely advertised. Building software should start with clearly documented use cases, and these use cases should define the entry points for the code. Materials and methods should include parameters for any executable. The environment configuration should also be added alongside the code to make it reproducible. For software to be redeployable on different platforms (also through time), it needs to be well documented, including open data and workflows. You need to be able to understand what the purpose of the experiment was and how it was done, and how the data was processed, if that is relevant. Version control and releases with DOIs are also important. Testing with proper positive and negative controls, integration, and validation are also critical to re-using software.

The roles of data stewards, RSEs and researchers

rawpixel-653764-unsplash

RSEs as ambassadors for software reproducibility

While researchers should lead when it comes to reproducibility, data stewards could help raise the awareness of this important issue and of the best practices for software reproducibility. RSEs often in support roles, standing between the researchers and their software – have a key role to play as ambassadors and should be part of the driving force behind efforts towards software reproducibility. In particular, they should be creating and maintaining software development guidelines. Research support roles, including those of data stewards and RSEs, should be more clearly defined and rewarded; these roles should not be seen or performed as just a side activity. RSEs should be actively involved in the research design and publication process, and should not been seen solely as a supporter of the researcher, but as a collaborator. Unfortunately, the current funding schemes do not reward these activities.

Cross-expertise speed-networking

Communication and interaction between the three key stakeholders (researchers, RSEs and data stewards) was seen as a shared responsibility. However, setting up cross-expertise speed-networking events could be an easy way to connect researchers, data stewards and RSEs, and to encourage collaboration. This type of initiatives could be implemented at institutional, national and/or even at international level. At the institutional level, a central service desk could work as a hub to connect researchers to research support experts. Encouraging collaboration by helping researchers connect with available experts provides a way to avoid redundant solutions to similar problems. For collaborations to be fruitful, however, researchers need to understand the perspective of RSEs and data stewards, and vice versa. Domain-specificity is another barrier that can block the collaboration between data stewards, RSEs and researchers.

How to encourage reproducibility in computational research?

As said earlier, researchers should lead when it comes to reproducibility. However, they may not always be interested in reproducibility, as reproducibility does not always guarantee good science. Researchers need to be intrinsically stimulated to document and review their code and to follow the best practices in software management and development. Publishing a methods or software paper that includes easy-to-reuse, high-quality software will help researchers get more citations. User friendly tools that help with software management and reproducibility will also stimulate use by researchers. 

Key recommendations

Reproducibility should be enforced from the top down

Journals and funders, in particular NWO, should enforce their policies. There should be funding for reproducibility; there should also be standards and requirements and appropriate audits. Data management plans as well as software sustainability plans are essential to ensure best practices. The funders need to become more aware of software sustainability and the needs for software management. For FAIR data there are funding opportunities, but these are not available for FAIR software. There is a need to make good practices in science the de facto standard. FAIR (both for data and software) should be the rule and no longer the exception. There should also be more recognition about publishing data and code, not only papers.

A leading role for national platforms

National platforms, such as the Netherlands eScience Center, should also be responsible and lead the research community into making software and data sustainability a recognised element of the research process. There is also a need among the research community for more knowledge and awareness about the NL eScience Center and the possibilities for collaborations between researchers and RSEs. In this respect, the Netherlands eScience Center should also take the lead in promoting collaboration between RSEs and researchers.

Community building as a bottom-up approach

Besides a top-down approach, building communities from the bottom up was also recommended as a way to connect researchers with relevant research support experts. The Dutch Techcentre for Life Sciences (DTL), for example, could set up a platform to connect individual researchers with software experts. This could be in the form of national cross-expertise speed-networking events or a forum. The NL-RSE initiative could also play a role in this regard and could help raise awareness of the issues around software reproducibility and sustainability.

Training

It is crucial to educate early career researchers, who have the time and interest. Courses and trainings are needed at the universities and at the national level. Researchers should be made aware of good practices for software development and software engineering at the earliest stages of their careers, including at the bachelor and master level.

Additional information

Workshop setup

The workshop session lasted two hours. It started with the organisers introducing themselves, followed by a short survey of the audience using Mentimeter, led by Yasemin Türkyilmaz-van der Velden. Maria Cruz then gave a presentation setting the scene, providing information on reproducibility and summarising the paper and the suggestions by Hut, van de Giesen & Drost (2017). Marc Galland gave a short presentation on software sustainability from the researcher’s point of view, and Carlos Martinez Ortiz gave his perspective on the same subject from the research software engineer’s point of view.

The audience was then split into four groups, with the organisers each joining a group to help facilitate the discussion. Each groups was allotted 45 minutes to answer the following questions within a collaborative google document:

  1. How can the advice by Hut, van de Giesen & Drost be put in practice?
  2. Any additional advice?
  3. How can researchers, RSEs, and data stewards work together towards implementing the advice?
  4. What needs to happen at the national level in the Netherlands to raise awareness of research software reproducibility and help implement the above or any of your ideas and recommendations?

About the participants

We asked a few questions to the audience, using Mentimeter, to get familiar with their background and their experiences with research software. As seen in the responses below, we had a mixed audience of researchers, research software engineers, data stewards, and people in other research support positions. As expected from a DTL conference, which focussed on the life sciences, most participants had a research background within this area, ranging from biomedical sciences to bioprocess engineering and plant breeding. All participants had experience with research software.

Screenshot_2018-10-26 20181002-mentimeter-Software Reproducibility DTL Utrecht pdf

 

Screenshot_2018-10-26 20181002-mentimeter-Software Reproducibility DTL Utrecht pdf(1)

Screenshot_2018-10-26 20181002-mentimeter-Software Reproducibility DTL Utrecht pdf(2)

Almost all participants agreed that there is a reproducibility crisis in science, reflecting the high level of awareness among the audience of this important issue. Before moving to the presentation about software reproducibility, we asked the participants what came to their mind about this topic. The answers, which ranged from version control, documentation and persistent identifiers to Git, containers, and Docker, clearly show that the audience was already very familiar with the topic of software reproducibility. In line with this, when we asked what they were doing themselves in terms of software reproducibility, we received very similar answers, with version control taking the lead among the answers to both questions.

Screenshot_2018-10-26 20181002-mentimeter-Software Reproducibility DTL Utrecht pdf(3)

Screenshot_2018-10-26 20181002-mentimeter-Software Reproducibility DTL Utrecht pdf(4)

Screenshot_2018-10-26 20181002-mentimeter-Software Reproducibility DTL Utrecht pdf(5)

Resources

Open Source Software Guidelines for Researchers

Written by Julie Beardsell and originally published on the ICT innovation blog.


Responding to the challenge

Navigating the often complex legal landscape of software licensing can be a genuine challenge for researchers, particularly when starting up a research project for the first time.

Today’s researchers, when starting out on a PhD, may typically need to be competent scientists and programmers, but also understand and be sufficiently knowledgeable to make the right choices for the licensing of the software that they build. Without the latter, they risk a number of potentially undesirable situations.

To help researchers navigate their way, a working group at TU Delft has put together a set of guidelines for researchers, which can be downloaded here.

In addition, the working group is drafting a document to provide more detailed information and links to related documents and useful sources.

Open, reproducibility, peer-review and building upon others’ work

The very nature of the research itself, may be to create or improve upon software, which might be worked on openly and collaboratively with others, from institutions other than those of the institution by which the researcher is employed.

In addition, the task of creating scientific software as output of the researcher does not end with the publication of results which will have been generated as a result of the developed software. Making that software available for inspection and use by other scientists is essential to reproducibility, peer-review, and the ability to build upon others’ work.

Importance of licenses

Licenses are important for setting out the terms on which software may be used, modified, or distributed and by whom. Without a license agreement, software may be left in a state of legal uncertainty in which potential users may not know which limitations owners may want to enforce, and owners may leave themselves vulnerable to legal claims or have difficulty controlling how their work is used. Licenses can also be used to facilitate access to software as well as restrict it.

Working group

The working group consists of Julie BeardsellMerlijn BazuineSusan Branchett, Maria Marques de Barros Cruz and Marta Teperek and the group would like to thank those researchers across the faculties who have contributed so far and encouraged the development of this initiative at TU Delft.

About the Author

Julie Beardsell is an innovation expert at TU Delft. Find her at TU Delft or LinkedIn.

“Open Source Software Guidelines for Researchers” by Julie Beardsell is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.

User Statistics, 4TU.ResearchData, 2011-2018

Vanaf februari 2011 zijn wij voor het data archief van 4TU.ResearchData begonnen met het bijhouden van de statistieken uit Google Analytics https://data.4tu.nl. Vandaag zijn er 7936 datasets te bekijken in ons archief. De heeft geleid tot de volgende resultaten.

As of February 2011, we started tracking the statistics from Google Analytics for the 4TU.ResearchData data archive. Today there are 7936 datasets to view in our archive. The result has resulted in the following results.

Onderstaande grafiek laat zien hoeveel gebruikers (gebruikers die in de periode ten minste één sessie hebben gestart) er per jaar ons archief hebben bezocht. Dit is gemeten t/m 19 oktober 2018, dus dit betekent dat de resultaten van 2018 nog niet volledig zijn.

The graph below shows how many users (users who started at least one session in the period) visited our archive each year. This has been measured until October 19, 2018, so this means that the results for 2018 are not yet complete.

De volgende grafiek laat zien hoeveel unieke paginaweergaves er per jaar hebben plaatsgevonden. Bij ‘Unieke paginaweergaven’ wordt het aantal sessies vermeld gedurende welke de opgegeven pagina ten minste één keer is weergegeven. Er wordt een unieke paginaweergave geteld voor elke combinatie pagina-URL + paginatitel. Ook hier moeten we erbij melden dat bij het maken van deze grafiek eind 2018 nog niet gehaald is.

The following graph shows how many unique page views have taken place per year. With ‘Unique page views’, the number of sessions during which the specified page is displayed is shown at least once. A unique page view is counted for each combination page URL + page title . As with above the 2018 stats are not complete yet.

Onze top 20 van best bekeken datasets ziet er als volgt uit / The 20 datasets looked at the most times:

  1. Real-life event logs – Hospital log – Eindhoven University of Technology
  2. BPI Challenge 2012 – Eindhoven University of Technology
  3. Receipt phase of an environmental permit application process (‘WABO’), CoSeLoG project – Eindhoven University of Technology
  4. BPI Challenge 2017 – Eindhoven University of Technology
  5. Road Traffic Fine Management Process – Eindhoven University of Technology
  6. BPI Challenge 2015 – Eindhoven University of Technology
  7. BPI Challenge 2014 – Rabobank Nederland
  8. Large Bank Transaction Process – Universitat Politècnica de Catalunya (Barcelonatech)
  9. BPI Challenge 2016 – UWV
  10. BPI Challenge 2015 Municipality 1 – Eindhoven University of Technology
  11. IDRA weather radar measurements – all data – TU Delft, Faculty of Civil Engineering and Geosciences
  12. Production Analysis with Process Mining Technology – NooL – Integrating People & Solutions
  13. BPI Challenge 2013 – Volvo IT
  14. Environmental permit application process (‘WABO’), CoSeLoG project – Eindhoven University of Technology
  15. Activities of daily living of several individuals – Universitat Politècnica de Catalunya, Barcelona, Spain
  16. Signatures of Majorana fermions in hybrid superconductor-semiconductor nanowire devices -TU Delft
  17. Sepsis Cases – Event Log – Eindhoven University of Technology, Department of Mathematics and Computer Science
  18. CFD in drinking water treatment – TU Delft, Faculty of Civil Engineering and Geosciences, Department of Water Management
  19. BPI Challenge 2017 – Offer log – Eindhoven University of Technology
  20. Loan application example – Eindhoven University of Technology

Hieronder is de wereldkaart te zien, waarin zichtbaar is hoeveel sessies er zijn geweest per land. Een sessie is de periode waarin een gebruiker actief is op de website. The world map below shows how many sessions there have been per country. A session is the period in which a user is active on the website.

De top 20 van de landen die ons het meest hebben bezocht / Visitors come from the following 20 countries, in order of number of users

  1. Netherlands
  2. United States
  3. Germany
  4. China
  5. India
  6. Italy
  7. United Kingdom
  8. France
  9. Brazil
  10. Spain
  11. South Korea
  12. Australia
  13. Belgium
  14. Austria
  15. Poland
  16. Iran
  17. Russia
  18. Japan
  19. Canada
  20. Indonesia
Aan al onze grafieken is goed zichtbaar dat we elk jaar blijven groeien. Een resultaat waar we trots op zijn.
109/5000
All our graphs clearly indicate that we continue to grow every year. A result that we are proud of. Full data will be published at the start of 2019