Author: Esther Plomp
The first Remote ReproHack was held on the 14th of May 2020. About 30 participants joined the online party with the mission to learn more about reproducibility and reproduce some papers! A ReproHack is a one day event where participants aim to reproduce papers of their choice, from a list of proposed papers of which the authors indicated that they would like to receive feedback on. The ReproHack aims to provide a safe space to provide constructive feedback, so that it is a valuable learning experience for the participants and the authors.
Recent studies and surveys have indicated that scientific papers can often not be reproduced because supporting data and code are not accessible or incorrect (see for example the Nature survey results here). In computational research only 26% of the papers are reproducible (Stodden 2018). To learn more about how these numbers can be improved, I joined the first ReproHack in the Netherlands last year. During this ReproHack I managed to reproduce the figures from a physics paper on Majorana bound states by André Melo and colleagues. I must admit that most of the work was done by Sander, who was very patient with my beginner Python skills. This year, I was set on trying to reproduce a paper that made use of R, a language that I learned to appreciate more since attending the Repro2020 course earlier this year.
The Remote Reprohack started with welcoming the participants through signing in on an online text document (HackMd) where we could list our names, affiliations and Twitter/GitHub information. This way we could learn more about the other participants. The check-in document also provided us with the schedule of the day, the list of research papers from which we could choose to reproduce, and the excellent code of conduct. After this digital check in and words of welcome, Daniel Nüst gave a talk about his work on improving the reproducibility of software and code. Next, Anna Krystalli, one of the organisers, took us through the process of how to reproduce and review the papers during the ReproHacking breakout sessions. During these breakout sessions the participants were ‘split up’ in smaller groups to work on the papers that they selected to reproduce. It was also possible to try to reproduce a paper by yourself.
10:00 – Welcome and Intro to Blackboard Collaborate
10:10 – Ice breaker session in groups
10:20 – TALK: Daniel Nüst – Research compendia enable code review during peer review (slides)
10:40 – TALK: Anna Krystalli – Tips and Tricks for Reproducing and Reviewing (slides)
11:00 – Select Papers
11:15 – Round I of ReproHacking (break-out rooms)
12:15 – Re-group and sharing of experiences
12:30 – LUNCH
13:30 – TALK: Daniel Piqué – How I discovered a missing data point in a paper with 8000+ citations
13:45 – Round II of ReproHacking (break-out rooms)
14:45 – COFFEE
15:00 – Round III of ReproHacking (break-out rooms) – Complete Feedback form
16:00 – Re-group and sharing of experiences
16:30 – TALK: Sarah Gibson – Sharing Reproducible Computational Environments with Binder (slides) (see also here for materials from a Binder Workshop)
16:45 – Feedback and Closing
The participants had ~15 minutes to decide which paper we would like to reproduce from a list that contained almost 50 papers! The group that I joined was going to reproduce the preprint by Eiko Fried et al. on mental health and social contact during the COVID19 pandemic. Our group consisted of Linda Nab, one of the organisers of the ReproHack, Alessandro Gasparini (check out his work on INTEREST here if you work with simulations!), Anna Lohmann, Ciu and myself. The first session was spent by finding out how we could download all the data and code from the Open Science Framework. After we retrieved all the files, we had to download packages (or update R). During the second session we were able to do more reproducing rather than just getting set up. The work by Eiko Fried was well structured and documented, so after the initial problems with getting everything set up, the process of reproducing the work went quite smoothly. In the end, we managed to reproduce the majority of the paper!
In the third session feedback was provided to the authors of the papers that were being reproduced using the feedback form that the ReproHack team set up. This form contained questions about which paper was chosen, if the participants were able to reproduce the paper, and how much of the paper was reproduced. In more detail we could describe which procedure/tools/operating system/software that we used to reproduce the paper and how familiar we were with these. We also had to rate the reusability of the material, and indicate if the material had a licence. A very important section of the feedback form asked which challenges we ran into while trying to reproduce the paper, and what the positive features were. A separate section was dedicated on the documentation of the data and code, asking how well the material was documented. Additional suggestions and comments to improve the reproducibility were also welcomed.
After everyone returned from the last breakout sessions and filled in their feedback forms, the groups took turns to discuss whether they were able to reproduce the papers that they had chosen and if not, which challenges they faced. Most of the papers that were selected were reproduced by the participants. It was noted that especially proper documentation, such as a readme files, manuals and comments in the scripts themselves explaining the correct operating instructions to users, were especially helpful in reproducing someone else’s work.
Another way of improving the quality and reproducibility of the research is by asking your colleagues to reproduce your findings and offer them a co-author position (see this paper by Reimer et al. (2019) for more details on the ‘co-pilot system’). Some universities have dedicated services for checking the code and data before they are published (see this service at Cornell university).
There are several tools available to check and clean your data:
- R packages such as assertr, pointblank, naniar, visdat, ExPanDaR, DataExplorer and validate (see here for the ‘validate’ paper and here for a tutorial).
- Python: awesome_eda
If you would like to learn more about ReproHacks, the Dutch ReproHack team wrote a paper on the Dutch ReproHack in November 2019. If you would like to participate, organise your own ReproHack, or contribute to the ReproHack work, the ReproHack team invites contributions on GitHub.
Anna Krystalli provided the Remote Reprohack participants with some additional resources to improve the reproducibility of our own papers:
- The Turing Way: a lightly opinionated guide to reproducible data science.
- Statistical Analyses and Reproducible Research: introduction of the concept of Research Compendia.
- Packaging data analytical work reproducibly using R (and friends)
- How to Read a Research Compendium
- Reproducible Research in R with rrtools: create a research compendium around materials associated with a published paper (text, data and code) using the R package rrtools (see here for an example).