On 27 and 28 February 2019, I attended the NSF FAIR Hackathon Workshop for Mathematics and the Physical Sciences research communities held in Alexandria, Virginia, USA. I travelled to the event at the invitation of the TU Delft Data Stewards and with the generous support of the Hackathon organisers, Natalie Meyers and Mike Hildreth from the University of Notre Dame.
Participants were encouraged to register and assemble as duos of researchers and/or students along with a data scientist and/or research data librarian. I was invited, as a data librarian with a research background in the physical sciences, to form a duo with Joseph Weston, a theoretical physicist by background and a scientific software developer at TU Delft, who is also one of the TU Delft Data Champions.
I presented about the Hackathon at the last TU Delft Data Champions meeting. The presentation is available via Zenodo. All the presentations and materials from the FAIR Hackathon are also publicly available. The FAIR data principles are defined and explained here. This blog post aims to offer some of my views and reflections on the workshop, as an addition to the presentation I gave at the Data Champions meeting on 21 May 2019.
The grand vision of FAIR
The workshop’s keynote presentation, given by George Strawn, was one the highlights of the event for me. His talk set clearly and authoritatively what is the vision behind FAIR and the challenges ahead. Strawn’s words still ring in my head: “FAIR data may bring a revolution on the same magnitude as the science revolution of the 17th century, by enabling reuse of all science outputs – not just publications.” Drawing parallels between the development of the internet and FAIR data, Strawn explained: “The internet solved the interoperability of heterogeneous networks problem. FAIR data’s aspiration is to solve the interoperability of heterogeneous data problem.” One computer (“the network is the computer”) was the result of the internet, one dataset will be FAIR’s achievement. FAIR data will be a core infrastructure as much as the internet is today.
“The internet solved the interoperability of heterogeneous networks problem. FAIR data’s aspiration is to solve the interoperability of heterogeneous data problem.” — George Strawn
Strawn warned that it isn’t going to be easy. The challenge of FAIR data is ten times harder to solve than that of the internet, intellectually but also with fewer resources. Strawn has strong credentials and track record in this matter. He was part of the team that transitioned the experimental ARPAnet (the precursor to today’s internet) into the global internet and he is part of the global efforts trying to bring about an Internet of FAIR Data and Services. In his view, “scientific revolution will come because of FAIR data, but likely not in a couple of years but in a couple decades.”
Researchers do not know about FAIR
Strawn referred mainly to technical and political challenges in his presentation. One of the challenges I encounter in my daily job as a research data community manager is not technical in nature but rather cultural and sociological: how to get researchers engaged with FAIR data and how to make them enthusiastic to join the road ahead? Many researchers are not aware of the FAIR principles, and those who are, do not always understand how, or are willing, to put the principles into practice. As reported in a recent news item in Nature Index, the 2018 State of Open Data report, published by Digital Science, found that just 15% of researchers were “familiar with FAIR principles”. Of the respondents to this survey who were familiar with FAIR, only about a third said that their data management practices were very compliant with the principles.
The workshop tried to address this particular challenge by bringing together researchers in the physical sciences, experts in data curation and data analysts, FAIR service providers and FAIR experts. About half of the participants were researchers, mainly in the areas of experimental high energy physics, chemistry, and materials science research, at different stages in their careers. Most were based in the US and funded by NSF.
These researchers were knowledgeable about data management and for the most part familiar with the FAIR principles. However, the answers to a questionnaire sent to all participants in preparation for the Hackathon, shows that even a very knowledgeable and interested group of participants, such as this one, struggled when answering detailed questions about the FAIR principles. For example, when asked specific questions about provenance metadata and ontologies and/or vocabularies, many respondents answered they didn’t know. As highlighted in the 2018 State of Open Data report, interoperability, and to a lesser extent re-usability, are the least understood of the the FAIR principles. Interoperability, in particular, is the one that causes most confusion.
There were many opportunities during the workshop to exchange ideas with the other participants and to learn from each other. There was much optimism and enthusiasm among the participants, but also some words of caution, especially from those who are trying to apply the FAIR principles in practice. The PubChem use case “Making Data Interoperable”, presented by Evan Bolton from the U.S. National Center for Biotechnology Information, was a case in point. It could be said, as noted by one of the participants, that the chemists “seem to really have their house in order” when it comes to metadata standards. Not all communities have such standards. However, when it comes to “teaching chemistry to computers” – or put in other words, to make it possible for datasets to be interrogated automatically, as intended by the FAIR principles – Bolton’s closing slide hit a more pessimistic note. “Annotating and FAIR-ifying scientific content can be difficult to navigate”, Bolton noted, and it can feel like chasing windmills. “Everything [is] a work in-progress” and “what you can do today may be different from tomorrow”.
What can individual researchers do?
If service providers, such as PubChem, are struggling, what are individual researchers to do? The best and most practical thing a researcher can do is to obtain a persistent identifier (e.g. a DOI) by uploading data to a trusted repository such as the 4TU.Centre for Research Data archive, hosted at TU Delft, or a more general archive such as Zenodo. This will make datasets at the very least Findable and Accessible. Zenodo conveniently lists on its website how it helps datasets comply with the FAIR principles. The 4TU.Centre for Research Data, and many other repositories, offer similar services when it comes to helping make data FAIR.
I am grateful to the University of Notre Dame for covering my travel costs to the MPS FAIR Hackathon. Special thanks to Natalie Meyers from the University of Notre Dame, and Marta Teperek, Yasemin Turkyilmaz-van der Velden and the TU Delft Data Stewards for making it possible for me to attend.
Maria Cruz is Community Manager Research Data Management at the VU Amsterdam.
We are happy to announce two new metadata elements added to the 4TU.ResearchData archive:
- Funder information;
To link datasets in a more structured way to funding, we have made funding information available in dedicated metadata fields. Depositors are asked to submit the name(s) of the funder and grant number as part of the standard metadata deposit when they submit their dataset.
The funding information is displayed on the public dataset landing page, which includes the funder identifier from the Funder Registry.
- Funding organizations are able to better track the published results of their grants
- Research institutions are able to monitor the published output of their employees
- Greater transparency on who funded the research
In addition to ‘Keyword’ that tells what the topic of the dataset is, we have recently added a new metadata element Subject to be able to expose datasets according to their field of research.
When submitting their dataset, depositors are required to choose one or more subject categories (or fields of research) from a list, which originates from the Australian and New Zealand Standard Research Classification (ANZSRC).
Each main subject category consists of sub-categories to make the field of research more specific, e.g. when selecting the main subject category ‘Biological Sciences’, depositors are offered the following sub-categories from which they can choose:
The Subject metadata element is added as search facet to allow users to refine their search results by subject category:
Another notable feature is that every subject category shows all datasets that belong to that category using the relation type ‘is subject of‘, and all related subject categories that shares datasets with the current subject category.
See for example: https://data.4tu.nl/repository/cat:0806
Datasets that have been deposited before we had this new metadata element in place, have been updated with one or more subject categories by our moderators. Although we have been very careful, we could have made a mistake. Should this be the case, please don’t hesitate to contact us at email@example.com.
On Thursday 30 August and on Friday 31 August TU Delft Library hosted two events dedicated to the new European General Data Protection Regulation (GDPR) and its implications for research data. Both events were organised by the Research Data Netherlands: collaboration between the 4TU.Center for Research Data, DANS and SURF (represented by the National Research Data Management Coordination Point).
First: do no harm. Protecting personal data is not against data sharing
On the first day, we heard case studies from experts in the field, as well as from various institutional support service providers. Veerle Van den Eynden from the UK Data Service kicked off the day with her presentation, which clearly stated that the need to protect personal is not against data sharing. She outlined the framework provided by the GDPR which make sharing possible, and explained that when it comes to data sharing one should always adhere to the principle “do no harm”. However, she reflected that too often, both researchers and research support services (such as ethics committees), prefer to avoid any possible risks rather than to carefully consider them and manage them appropriately. She concluded by providing a compelling case study from the UK Data Service, where researchers were able to successfully share data from research on vulnerable individuals (asylum seekers and refugees).
From a one-stop shop solution to privacy champions
We have subsequently heard case studies from four Dutch research institutions: Tilburg University, TU Delft, VU Amsterdam and Erasmus University Rotterdam about their practical approaches to supporting researchers working with personal research data. Jan Jans from Tilburg explained their “one stop shop” form, which, when completed by researchers, sorts out all the requirements related to GDPR, ethics and research data management. Marthe Uitterhoeve from TU Delft said that Delft was developing a similar approach, but based on data management plans. Marlon Domingus from Erasmus University Rotterdam explained their process based on defining different categories of research and determining the types of data processing associated with them, rather than trying to list every single research project at the institution. Finally, Jolien Scholten from VU Amsterdam presented their idea of appointing privacy champions who receive dedicated training on data protection and who act as the first contact points for questions related to GDPR within their communities.
Lots of inspiring ideas and there was a consensus in the room that it would be worth re-convening in a year’s time to evaluate the different approaches and to share lessons learned.
How to share research data in practice?
Next, we discussed three different models for helping researchers share their research data. Emilie Kraaikamp from DANS presented their strategy for providing two different access levels to data: open access data and restricted access data. Open datasets consist mostly of research data which are fully anonymised. Restricted access data need to be requested (via an email to the depositor) before the access can be granted (the depositor decides whether access to data can be granted or not).
Veerle Van Den Eynden from the UK Data Service discussed their approach based on three different access levels: open data, safeguarded data (equivalent to “restricted access data” in DANS) and controlled data. Controlled datasets are very sensitive and researchers who wish to get access to such datasets need to undergo a strict vetting procedure. They need to complete training, their application needs to be supported by a research institution, and typically researchers access such datasets in safe locations, on safe servers and are not allowed to copy the data. Veerle explained that only a relatively small number of sensitive datasets (usually from governmental agencies) are shared under controlled access conditions.
The last case study was from Zosia Beckles from the University of Bristol, who explained that at Bristol, a dedicated Data Access Committee has been created to handle requests for controlled access datasets. Researchers responsible for the datasets are asked for advice how to respond to requests, but it is the Data Access Committee who ultimately decides whether access should be granted or not, and, if necessary, can overrule the researcher’s advice. The procedure relieves researchers from the burden of dealing with data access requests.
DataTags – decisions about sharing made easy(ier)
Ilona von Stein from DANS continued the discussion about data sharing and means by which sharing could be facilitated. She described an online tool developed by DANS (based on a concept initially developed by colleagues from Harvard University, but adapted to European GDPR needs) allowing researchers to answer simple questions about their datasets and to return a tag, which defines whether data is suitable for sharing and what are the most suitable sharing options. The prototype of the tool is now available for testing and DANS plans to develop it further to see if it could be also used to assist researchers working with data across the whole research lifecycle (not only at the final, data sharing stage).
What are the most impactful & effortless tactics to provide controlled access to research data?
The final interactive part of the workshop was led by Alastair Dunning, the Head of 4TU.Center for Research Data. Alastair used Mentimeter to ask attendees to judge the impact and effort of fourteen different tactics and solutions which can be used at research institutions to provide controlled access to research data. More than forty people engaged with the online survey and this allowed Alastair to shortlist five tactics which were deemed the most impactful/effort-efficient:
- Create a list of trusted archives for researchers can deposit personal data
- Publish an informed consent template for your researchers
- Publish on university website a list of FAQs concerning personal data
- Provide access to a trusted Data Anonymisation Service
- Create categories to define different types of personal data at your institution
Alastair concluded that these should probably be the priorities to work on for research institutions which don’t yet have the above in place.
How to put all the learning into practice?
The second event was dedicated to putting all the learning and concepts developed during the first day into practice. Researchers working with personal data, as well as those directly supporting researchers, brought their laptops and followed practical exercises led by Veerle Van den Eynden and Cristina Magder from the UK Data Service. We started by looking at a GDPR-compliant consent form template. Subsequently, we practised data encryption using VeraCrypt. We then moved to data anonymisation strategies. First, Veerle explained possible tactics (again, with nicely illustrated examples) for de-identification and pseudo-nymisation of qualitative data. This was then followed by a comprehensive hands-on training delivered by Cristina Magder on disclosure review and de-identification of quantitative data using sdcMicro.
Altogether, the practical exercises allowed one to clearly understand how to effectively work with personal research data from the very start of the project (consent, encryption) all the way to data de-identification to enable sharing and data re-use (whilst protecting personal data at all stages).
Conclusion: GDPR as an opportunity
I think that the key conclusion of both days was that the GDPR, while challenging to implement, provides an excellent opportunity both to researchers and to research institutions to review and improve their research practices. The key to this is collaboration: across the various stakeholders within the institution (to make workflows more coherent and improve collaboration), but also between different institutions. An important aspect of these two events was that representatives from multiple institutions (and countries!) were present to talk about their individual approaches and considerations. Practice exchange and lessons learned can be invaluable to allow institutions to avoid similar mistakes and to decide which approaches might work best in particular settings.
We will definitely consider organising a similar meeting in a year’s time to see where everyone is and which workflows and solutions tend to work best.
Presentations from both events are available on Zenodo:
Written by Marta Teperek and Alastair Dunning
There are lots of drivers pushing for preservation of research data long-term and to make them Findable Accessible Interoperable are Re-usable. There is a consensus that sharing and preserving data makes research more efficient (no need to generate same data all over again), more innovative (data re-use across disciplines) and more reproducible (data supporting research findings are available for scrutiny and validation). Consequently, most funding bodies require that research data are stored, preserved and made available for at least 10 years.
For example, the European Commission requires that projects “develop a Data Management Plan (DMP), in which they will specify what data will be open: detailing what data the project will generate, whether and how it will be exploited or made accessible for verification and re-use, and how it will be curated and preserved.”
But who should pay for that long-term data storage and preservation?
Given that most funding bodies now require that research data is preserved and made available long-term, it is perhaps natural to think that funding bodies would financially support researchers in meeting these new requirements. Coming back to the previous example, the funding guide for the European Commission’s Horizon 2020 funding programme says that “costs associated with open access to research data, including the creation of the data management plan, can be claimed as eligible costs of any Horizon 2020 grant.”
So one would think that the problem is solved and that funding for making data available long-term can be obtained. But then… why would we be writing this blog post?… As is usually the case, the devil is in the detail. European Commission’s financial rules require that grant money can only be spent during the timeline of the project (and only for the duration of the project).
Naturally, long-term preservation of research data occurs only after datasets have been created and curated, and most of the time only starts at the time when the project finishes. In other words, the costs of long-term data preservation are not eligible costs on grants funded by the European Commission*.
Importantly, the European Commission’s funding is just an example. Most funding bodies do not consider the costs of long-term data curation as eligible costs on grants. In fact, the author is not aware of any funding body which would consider these costs eligible**.
So what’s the solution?
Funding bodies suggest that long-term data preservation should be offered to researchers as one of the standard institutional support services. The costs of these should be recovered within overhead/indirect funding allocation on grant applications. Grants from the European Commission have a flat 25% rate overhead allocation. Which is already generous compared with some other funding bodies which do not allow any overhead cost allocation at all. The problem is that at larger, research-intensive institutions, the overhead costs are at around 50% of the original grant value.
This means that for every 1 mln Euro which researchers receive to spend on their research projects, research institutions need to find an extra 0.5 mln Euro from elsewhere to support these projects (facilities costs, administration support, IT support, etc.). Therefore, given that institutions are already not recovering their full economic costs from research grants, it is difficult to imagine how the new requirements for long-term data preservation can be absorbed within the existing overhead/indirect costs stream.
The problems described above are not new. In fact, these were discussed with funding bodies already on several occasions (see here and here for some examples). But not much has changed so far. There were no new streams of money made available: nor through direct grant funding, nor through increased overhead caps for institutions providing long-term preservation services for research data.
Meantime, researchers (those creating large datasets in particular) continue to struggle to find financial support for long-term preservation and curation of their research data, as nicely illustrated in a blog post by our colleagues at Cambridge.
Since the discussions with funding bodies held by individual institutions did not seem to be fruitful, perhaps the time has come for some joint up national (or international) efforts. Could this be an interesting new project to tackle by the Dutch National Coordination Point Research Data Management (LCRDM)?
* – Some suggest that the costs are eligible if the invoices for long-term data preservation are paid during the lifetime of the project. However, this is only true if the invoice itself does not specify that the costs are for long-term preservation (i.e. says that the invoice is simply for ‘storage charges’, without indicating the long-term aspects of it). Which only confirms the fact that funders are not willing to pay for long-term preservation and forces some to use more creative tactics and measure to finance long-term preservation.
** – Two funding bodies in the UK, NERC (Natural Environment Research Council) and ESRC (Economic and Social Research Council), pay for the costs of long-term data preservation by financing their own data archives (NERC Data Centres and the UK Data Service, respectively) where the grantees are required to deposit any data resulting from the awarded funding.
A PDF (and citable) version of this document is available via Zenodo. DOI: https://doi.org/10.5281/zenodo.1316938
We talked with Dr. Riccardo Riva, an assistant professor at the TU Delft Faculty of Civil Engineering and Geosciences who has published several datasets via the 4TU.Centre for Research Data. We spoke about his recent paper in the open access journal The Cryosphere on the surprising effects of melting glaciers and ice sheets on the solid Earth though the last century and how this affects reconstructions of past sea level from sparse observations.
The data underlying Riva’s paper were made publicly available through the 4TU.Centre for Research Data. Riva believes that sharing data “helps progress in science” and that “if you get public money to do research, then the results should be public”.
“When data are open, then anybody can use it. There will be some competition, but that’s only good. Competition leads to new ideas, which in turn lead to even more ideas and to progress in science.”
The 4TU.Centre for Research Data, hosted by the TU Delft Library, offers researchers a reliable long-term archive for technical and scientific research data. It creates opportunities for linking publications to underlying data thereby promoting improved findability and citability for research data. Over 90% of the data stored in the archive are environmental research data coded in netCDF – a data format and data model that, although generic, is mostly used in climate, ocean and atmospheric sciences. Therefore, 4TU.ResearchData has a special interest in this area and offers specific services and tools to enhance the access to and the use of netCDF datasets. TU Delft Library also offers Research Data Management Support during all stages of the research lifecycle.
This presentation is available via Zenodo 10.5281/zenodo.1252925.