Written by Marta Teperek and Alastair Dunning
There are lots of drivers pushing for preservation of research data long-term and to make them Findable Accessible Interoperable are Re-usable. There is a consensus that sharing and preserving data makes research more efficient (no need to generate same data all over again), more innovative (data re-use across disciplines) and more reproducible (data supporting research findings are available for scrutiny and validation). Consequently, most funding bodies require that research data are stored, preserved and made available for at least 10 years.
For example, the European Commission requires that projects “develop a Data Management Plan (DMP), in which they will specify what data will be open: detailing what data the project will generate, whether and how it will be exploited or made accessible for verification and re-use, and how it will be curated and preserved.”
But who should pay for that long-term data storage and preservation?
Given that most funding bodies now require that research data is preserved and made available long-term, it is perhaps natural to think that funding bodies would financially support researchers in meeting these new requirements. Coming back to the previous example, the funding guide for the European Commission’s Horizon 2020 funding programme says that “costs associated with open access to research data, including the creation of the data management plan, can be claimed as eligible costs of any Horizon 2020 grant.”
So one would think that the problem is solved and that funding for making data available long-term can be obtained. But then… why would we be writing this blog post?… As is usually the case, the devil is in the detail. European Commission’s financial rules require that grant money can only be spent during the timeline of the project (and only for the duration of the project).
Naturally, long-term preservation of research data occurs only after datasets have been created and curated, and most of the time only starts at the time when the project finishes. In other words, the costs of long-term data preservation are not eligible costs on grants funded by the European Commission*.
Importantly, the European Commission’s funding is just an example. Most funding bodies do not consider the costs of long-term data curation as eligible costs on grants. In fact, the author is not aware of any funding body which would consider these costs eligible**.
So what’s the solution?
Funding bodies suggest that long-term data preservation should be offered to researchers as one of the standard institutional support services. The costs of these should be recovered within overhead/indirect funding allocation on grant applications. Grants from the European Commission have a flat 25% rate overhead allocation. Which is already generous compared with some other funding bodies which do not allow any overhead cost allocation at all. The problem is that at larger, research-intensive institutions, the overhead costs are at around 50% of the original grant value.
This means that for every 1 mln Euro which researchers receive to spend on their research projects, research institutions need to find an extra 0.5 mln Euro from elsewhere to support these projects (facilities costs, administration support, IT support, etc.). Therefore, given that institutions are already not recovering their full economic costs from research grants, it is difficult to imagine how the new requirements for long-term data preservation can be absorbed within the existing overhead/indirect costs stream.
The problems described above are not new. In fact, these were discussed with funding bodies already on several occasions (see here and here for some examples). But not much has changed so far. There were no new streams of money made available: nor through direct grant funding, nor through increased overhead caps for institutions providing long-term preservation services for research data.
Meantime, researchers (those creating large datasets in particular) continue to struggle to find financial support for long-term preservation and curation of their research data, as nicely illustrated in a blog post by our colleagues at Cambridge.
Since the discussions with funding bodies held by individual institutions did not seem to be fruitful, perhaps the time has come for some joint up national (or international) efforts. Could this be an interesting new project to tackle by the Dutch National Coordination Point Research Data Management (LCRDM)?
* – Some suggest that the costs are eligible if the invoices for long-term data preservation are paid during the lifetime of the project. However, this is only true if the invoice itself does not specify that the costs are for long-term preservation (i.e. says that the invoice is simply for ‘storage charges’, without indicating the long-term aspects of it). Which only confirms the fact that funders are not willing to pay for long-term preservation and forces some to use more creative tactics and measure to finance long-term preservation.
** – Two funding bodies in the UK, NERC (Natural Environment Research Council) and ESRC (Economic and Social Research Council), pay for the costs of long-term data preservation by financing their own data archives (NERC Data Centres and the UK Data Service, respectively) where the grantees are required to deposit any data resulting from the awarded funding.
Colleagues in university libraries in the United States recently published an ambitious and well thought out shared service model for research data curation.
It is based on a very pertinent observation. Looking after the huge panoply of file formats and software types that constitute research data is beyond the remit of any single institution. The expertise needed will sit in many different places.
The model also highlights how we need to focus on data curation at the moment researchers wish to publish a dataset. Curation is not something to be done tomorrow.
At our own data archive, 4TU.ResearchData, our day-to-day work is not so much on data curation of datasets that have already been deposited with us.
Rather, the focus is on the moment researchers (some of whom are from Dutch universities, some of whom from further afield) send us their initial dataset. Often, this comes with limited metadata and next-to-none documentation.
It’s the role of our data moderators to liaise with the researchers, improving the quality of the metadata and adding enough documentation to explain the codes, acronyms and software that accompany the datasets. This work is crucial – it ensures that the data is re-usable by other researchers.
Of course, we still need to think about the long-term data curation to make sure the data is still readable in a technical sense. But our greater challenge is making sure the data is contextualised.
So it’s good that such work is at the heart of the Data Curation Network. The workflow suggested fully acknowledges the need to work with moderators who can fully appraise the data and provide the necessary curation expertise at the moment of deposit.