On Thursday 30 August and on Friday 31 August TU Delft Library hosted two events dedicated to the new European General Data Protection Regulation (GDPR) and its implications for research data. Both events were organised by the Research Data Netherlands: collaboration between the 4TU.Center for Research Data, DANS and SURF (represented by the National Research Data Management Coordination Point).
First: do no harm. Protecting personal data is not against data sharing
On the first day, we heard case studies from experts in the field, as well as from various institutional support service providers. Veerle Van den Eynden from the UK Data Service kicked off the day with her presentation, which clearly stated that the need to protect personal is not against data sharing. She outlined the framework provided by the GDPR which make sharing possible, and explained that when it comes to data sharing one should always adhere to the principle “do no harm”. However, she reflected that too often, both researchers and research support services (such as ethics committees), prefer to avoid any possible risks rather than to carefully consider them and manage them appropriately. She concluded by providing a compelling case study from the UK Data Service, where researchers were able to successfully share data from research on vulnerable individuals (asylum seekers and refugees).
From a one-stop shop solution to privacy champions
We have subsequently heard case studies from four Dutch research institutions: Tilburg University, TU Delft, VU Amsterdam and Erasmus University Rotterdam about their practical approaches to supporting researchers working with personal research data. Jan Jans from Tilburg explained their “one stop shop” form, which, when completed by researchers, sorts out all the requirements related to GDPR, ethics and research data management. Marthe Uitterhoeve from TU Delft said that Delft was developing a similar approach, but based on data management plans. Marlon Domingus from Erasmus University Rotterdam explained their process based on defining different categories of research and determining the types of data processing associated with them, rather than trying to list every single research project at the institution. Finally, Jolien Scholten from VU Amsterdam presented their idea of appointing privacy champions who receive dedicated training on data protection and who act as the first contact points for questions related to GDPR within their communities.
Lots of inspiring ideas and there was a consensus in the room that it would be worth re-convening in a year’s time to evaluate the different approaches and to share lessons learned.
How to share research data in practice?
Next, we discussed three different models for helping researchers share their research data. Emilie Kraaikamp from DANS presented their strategy for providing two different access levels to data: open access data and restricted access data. Open datasets consist mostly of research data which are fully anonymised. Restricted access data need to be requested (via an email to the depositor) before the access can be granted (the depositor decides whether access to data can be granted or not).
Veerle Van Den Eynden from the UK Data Service discussed their approach based on three different access levels: open data, safeguarded data (equivalent to “restricted access data” in DANS) and controlled data. Controlled datasets are very sensitive and researchers who wish to get access to such datasets need to undergo a strict vetting procedure. They need to complete training, their application needs to be supported by a research institution, and typically researchers access such datasets in safe locations, on safe servers and are not allowed to copy the data. Veerle explained that only a relatively small number of sensitive datasets (usually from governmental agencies) are shared under controlled access conditions.
The last case study was from Zosia Beckles from the University of Bristol, who explained that at Bristol, a dedicated Data Access Committee has been created to handle requests for controlled access datasets. Researchers responsible for the datasets are asked for advice how to respond to requests, but it is the Data Access Committee who ultimately decides whether access should be granted or not, and, if necessary, can overrule the researcher’s advice. The procedure relieves researchers from the burden of dealing with data access requests.
DataTags – decisions about sharing made easy(ier)
Ilona von Stein from DANS continued the discussion about data sharing and means by which sharing could be facilitated. She described an online tool developed by DANS (based on a concept initially developed by colleagues from Harvard University, but adapted to European GDPR needs) allowing researchers to answer simple questions about their datasets and to return a tag, which defines whether data is suitable for sharing and what are the most suitable sharing options. The prototype of the tool is now available for testing and DANS plans to develop it further to see if it could be also used to assist researchers working with data across the whole research lifecycle (not only at the final, data sharing stage).
What are the most impactful & effortless tactics to provide controlled access to research data?
The final interactive part of the workshop was led by Alastair Dunning, the Head of 4TU.Center for Research Data. Alastair used Mentimeter to ask attendees to judge the impact and effort of fourteen different tactics and solutions which can be used at research institutions to provide controlled access to research data. More than forty people engaged with the online survey and this allowed Alastair to shortlist five tactics which were deemed the most impactful/effort-efficient:
- Create a list of trusted archives for researchers can deposit personal data
- Publish an informed consent template for your researchers
- Publish on university website a list of FAQs concerning personal data
- Provide access to a trusted Data Anonymisation Service
- Create categories to define different types of personal data at your institution
Alastair concluded that these should probably be the priorities to work on for research institutions which don’t yet have the above in place.
How to put all the learning into practice?
The second event was dedicated to putting all the learning and concepts developed during the first day into practice. Researchers working with personal data, as well as those directly supporting researchers, brought their laptops and followed practical exercises led by Veerle Van den Eynden and Cristina Magder from the UK Data Service. We started by looking at a GDPR-compliant consent form template. Subsequently, we practised data encryption using VeraCrypt. We then moved to data anonymisation strategies. First, Veerle explained possible tactics (again, with nicely illustrated examples) for de-identification and pseudo-nymisation of qualitative data. This was then followed by a comprehensive hands-on training delivered by Cristina Magder on disclosure review and de-identification of quantitative data using sdcMicro.
Altogether, the practical exercises allowed one to clearly understand how to effectively work with personal research data from the very start of the project (consent, encryption) all the way to data de-identification to enable sharing and data re-use (whilst protecting personal data at all stages).
Conclusion: GDPR as an opportunity
I think that the key conclusion of both days was that the GDPR, while challenging to implement, provides an excellent opportunity both to researchers and to research institutions to review and improve their research practices. The key to this is collaboration: across the various stakeholders within the institution (to make workflows more coherent and improve collaboration), but also between different institutions. An important aspect of these two events was that representatives from multiple institutions (and countries!) were present to talk about their individual approaches and considerations. Practice exchange and lessons learned can be invaluable to allow institutions to avoid similar mistakes and to decide which approaches might work best in particular settings.
We will definitely consider organising a similar meeting in a year’s time to see where everyone is and which workflows and solutions tend to work best.
Presentations from both events are available on Zenodo:
Written by Marta Teperek and Alastair Dunning
There are lots of drivers pushing for preservation of research data long-term and to make them Findable Accessible Interoperable are Re-usable. There is a consensus that sharing and preserving data makes research more efficient (no need to generate same data all over again), more innovative (data re-use across disciplines) and more reproducible (data supporting research findings are available for scrutiny and validation). Consequently, most funding bodies require that research data are stored, preserved and made available for at least 10 years.
For example, the European Commission requires that projects “develop a Data Management Plan (DMP), in which they will specify what data will be open: detailing what data the project will generate, whether and how it will be exploited or made accessible for verification and re-use, and how it will be curated and preserved.”
But who should pay for that long-term data storage and preservation?
Given that most funding bodies now require that research data is preserved and made available long-term, it is perhaps natural to think that funding bodies would financially support researchers in meeting these new requirements. Coming back to the previous example, the funding guide for the European Commission’s Horizon 2020 funding programme says that “costs associated with open access to research data, including the creation of the data management plan, can be claimed as eligible costs of any Horizon 2020 grant.”
So one would think that the problem is solved and that funding for making data available long-term can be obtained. But then… why would we be writing this blog post?… As is usually the case, the devil is in the detail. European Commission’s financial rules require that grant money can only be spent during the timeline of the project (and only for the duration of the project).
Naturally, long-term preservation of research data occurs only after datasets have been created and curated, and most of the time only starts at the time when the project finishes. In other words, the costs of long-term data preservation are not eligible costs on grants funded by the European Commission*.
Importantly, the European Commission’s funding is just an example. Most funding bodies do not consider the costs of long-term data curation as eligible costs on grants. In fact, the author is not aware of any funding body which would consider these costs eligible**.
So what’s the solution?
Funding bodies suggest that long-term data preservation should be offered to researchers as one of the standard institutional support services. The costs of these should be recovered within overhead/indirect funding allocation on grant applications. Grants from the European Commission have a flat 25% rate overhead allocation. Which is already generous compared with some other funding bodies which do not allow any overhead cost allocation at all. The problem is that at larger, research-intensive institutions, the overhead costs are at around 50% of the original grant value.
This means that for every 1 mln Euro which researchers receive to spend on their research projects, research institutions need to find an extra 0.5 mln Euro from elsewhere to support these projects (facilities costs, administration support, IT support, etc.). Therefore, given that institutions are already not recovering their full economic costs from research grants, it is difficult to imagine how the new requirements for long-term data preservation can be absorbed within the existing overhead/indirect costs stream.
The problems described above are not new. In fact, these were discussed with funding bodies already on several occasions (see here and here for some examples). But not much has changed so far. There were no new streams of money made available: nor through direct grant funding, nor through increased overhead caps for institutions providing long-term preservation services for research data.
Meantime, researchers (those creating large datasets in particular) continue to struggle to find financial support for long-term preservation and curation of their research data, as nicely illustrated in a blog post by our colleagues at Cambridge.
Since the discussions with funding bodies held by individual institutions did not seem to be fruitful, perhaps the time has come for some joint up national (or international) efforts. Could this be an interesting new project to tackle by the Dutch National Coordination Point Research Data Management (LCRDM)?
* – Some suggest that the costs are eligible if the invoices for long-term data preservation are paid during the lifetime of the project. However, this is only true if the invoice itself does not specify that the costs are for long-term preservation (i.e. says that the invoice is simply for ‘storage charges’, without indicating the long-term aspects of it). Which only confirms the fact that funders are not willing to pay for long-term preservation and forces some to use more creative tactics and measure to finance long-term preservation.
** – Two funding bodies in the UK, NERC (Natural Environment Research Council) and ESRC (Economic and Social Research Council), pay for the costs of long-term data preservation by financing their own data archives (NERC Data Centres and the UK Data Service, respectively) where the grantees are required to deposit any data resulting from the awarded funding.
A PDF (and citable) version of this document is available via Zenodo. DOI: https://doi.org/10.5281/zenodo.1316938
We talked with Dr. Riccardo Riva, an assistant professor at the TU Delft Faculty of Civil Engineering and Geosciences who has published several datasets via the 4TU.Centre for Research Data. We spoke about his recent paper in the open access journal The Cryosphere on the surprising effects of melting glaciers and ice sheets on the solid Earth though the last century and how this affects reconstructions of past sea level from sparse observations.
The data underlying Riva’s paper were made publicly available through the 4TU.Centre for Research Data. Riva believes that sharing data “helps progress in science” and that “if you get public money to do research, then the results should be public”.
“When data are open, then anybody can use it. There will be some competition, but that’s only good. Competition leads to new ideas, which in turn lead to even more ideas and to progress in science.”
The 4TU.Centre for Research Data, hosted by the TU Delft Library, offers researchers a reliable long-term archive for technical and scientific research data. It creates opportunities for linking publications to underlying data thereby promoting improved findability and citability for research data. Over 90% of the data stored in the archive are environmental research data coded in netCDF – a data format and data model that, although generic, is mostly used in climate, ocean and atmospheric sciences. Therefore, 4TU.ResearchData has a special interest in this area and offers specific services and tools to enhance the access to and the use of netCDF datasets. TU Delft Library also offers Research Data Management Support during all stages of the research lifecycle.
This presentation is available via Zenodo 10.5281/zenodo.1252925.
Today we are presenting at the PV2018 conference in Harwell, UK.
This presentation can be downloaded from Zenodo.
The paper for the conference proceedings is available on OSF Preprints.
Title: Adding Value and Facilitating Data Reuse: the Case of the 4TU.Centre for Research Data
Authors: Maria J. Cruz, Egbert Gramsbergen
Abstract: The history of the 4TU.Centre for Research Data goes back to 2008, when it started as a project of the libraries of three technical universities in the Netherlands. The aim was to serve the data curation needs of heterogeneous research communities. Fast forward ten years, and over 90% of the data stored in the 4TU archive are geoscientific datasets coded in netCDF (Network Common Data Form). This is a data format and model that, although generic, is mostly and widely used in atmospheric sciences and oceanography. As an endeavour to ensure that the 4TU.Centre for Research Data remains relevant and successful in the long term, we are exploring options for expanding the services related to netCDF data and potentially build a community of netCDF data depositors and users. Here we present the results of semi-structured, qualitative interviews with eleven researchers, all based in the Netherlands, who use and produce netCDF data; nine of them deposited netCDF data in the 4TU archive. These researchers represent heterogeneous research communities within the Earth sciences, with different views and attitudes to data archiving and publishing. Any new services or community building attempts will need to take this diversity into account. A common need for training and advice may guide the way forward for the 4TU.Centre for Research Data.
These are the most important updates:
- Django has been updated from version 1.11.10 to version 12.12
- The new status “published” for submitted datasets has been introduced. This status will be applied to datasets that are published and findable.
The status “accepted” which was formerly used for published datasets, will be applied to datasets that are ‘waiting’ to be published (e.g. when an embargo date is set or Egbert is working on the data files)
- Smaller updates to the forms regarding accepting and publishing datasets