Design and Analysis of Replication Studies

A workshop by the Center of Reproducibility Science (CRS) in Zurich.

January 22 - 24, 2020

Venue: University of Zurich, Epidemiology, Biostatistics and Prevention Institute (EBPI), Hirschengraben 82, room HIT H 03. You will find this building by walking around Hirschengraben 84. Contact us anytime by email

About the workshop

The goal of this international workshop is a thorough methodological discussion regarding the design and the analysis of replication studies including specialists from different fields such as clinical research, psychology, economics and others.

All interested researchers are invited to participate. Limited seating available, please register below.

Program Info

22 Jan, 2020

12 : 45 - 13 : 00


By Leonhard Held, CRS Director

13 : 00 - 16 : 00

Tutorial on The R package ReplicationSuccess: Design of Replication Experiments

By Charlotte Micheloud, Samuel Pawel, Leonhard Held

Statistical power is of central importance in assessing the reliability of science. Appropriate design of a replication study is key to tackling the replication crisis as many such studies are currently severely under-powered. The workshop will describe standard and more advanced methods to calculate the required sample size of a replication study taking into account the results of an original discovery study. Participants will learn how to use the R-package ReplicationSuccess. Prerequisites include basic R-knowledge and familiarity with concepts of statistical inference.

23 Jan, 2020

09 : 00 - 09 : 05


By Leonhard Held, CRS Director

9 : 05 - 9 : 50

What should we "expect" from reproducibility?

By Stephen Senn, Consultant Statistician, Edinburgh

Is there really a reproducibility crisis and if so are P-values to blame? Choose any statistic you like and carry out two identical independent studies and report this statistic for each. In advance of collecting any data, you ought to expect that it is just as likely that statistic 1 will be smaller than statistic 2 as vice versa. Once you have seen statistic 1, things are not so simple but if they are not so simple, it is that you have other information in some form. However, it is at least instructive that you need to be careful in jumping to conclusions about what to expect from reproducibility. Furthermore, the forecasts of good Bayesians ought to obey a Martingale property. On average you should be in the future where you are now but, of course, your inferential random walk may lead to some peregrination before it homes in on “the truth”. But you certainly can’t generally expect that a probability will get smaller as you continue. P-values, like other statistics are a position not a movement. Although often claimed, there is no such things as a trend towards significance. Using these and other philosophical considerations I shall try and establish what it is we want from reproducibility. I shall conclude that we statisticians should probably be paying more attention to checking that standard errors are being calculated appropriately and rather less to inferential framework.

9 : 50 - 10 : 20

Replicability as generalizability: Revisiting external validity with specification curve analysis

By Johannes Ullrich, University of Zurich

Abstract coming soon


10 : 20 - 10 : 50

Coffee break

10 : 50 - 11 : 35

Direct and conceptual replications

By Anna Dreber, Stockholm School of Economics

Abstract coming soon

11 : 35 - 12 : 05

The role of replication studies in navigating epistemic landscapes

By Filip Melinščak, University of Zurich

There has been a great deal of debate on the role of replication studies in improving the reliability of science. Most of these debates have revolved around research that can be conceptualized as either testing null hypotheses or doing comparisons between multiple theoretical models. For these types of research, formal models describing the scientific process and illuminating the role of reproducibility within it, have recently been proposed (McElreath & Smaldino 2015, Devezer et al. 2019). However, many domains of science - especially applied ones - investigate a rather different type of questions. For example: “how to design an industrial process for maximal efficiency?”, “what is the optimal treatment plan for a given disease?”, “what educational policy would maximize student outcomes?”. We can fruitfully conceptualize such research programs as trying to empirically find optimal solutions to various problems, which has been likened to exploring “epistemic landscapes” in search of “peaks” (Weisberg & Muldoon, 2009). Here we investigate what is the role of replication studies - and experimental design, generally - in such optimization-centric research programs. Using agent-based modeling, we evaluate how efficiently can different research strategies find peaks in multidimensional epistemic landscapes.


12 : 05 - 13 : 30

Flying lunch

13 : 30 - 14 : 00

When is a Replication Successful? No p-values, please!

By Werner Stahel, Seminar für Statistik, Swiss Federal Institute of Technology, Zürich

When are the results of a replication study similar enough to those of the original to say that the original claim was confirmed or that the replication was successful?Often, this decision is based on statistical significance: The original found an effect to be significant, and the replication is called successful if it has again produced a significant estimate of the effect. It is desirable to devise measures of the degree of success, avoiding a simple dichotomy.

It is now widely propagated that null hypothesis testing be replaced by estimation with confidence intervals. In fact, science is not interested in tiny effects, and therefore, a scientific question asks if an effect is relevantly different from zero, and a threshold for "relevance" is needed. This has consequences for the interpretation of both an "original" study and its replication. Nevertheless, the literature still focusses on p-values, and even in this workshop, we hear about quite sophisticated methods using them.

The simplest measure of (dis-) similarity between an original and a replication study is the difference between effect sizes. Some care regarding standardization is needed to make sure that it is a parameter of the model that can be estimated. Based of an estimate of the similarity and of the effect in the replication, I propose a classification of results that characterizes the "success of replication".

In random effects meta-analysis, an important characteristic is the ratio of between-study and within-study variation. Surprisingly, the usual definition fails to be a parameter of the model. When an "original" study should be substantiated by replication, we should expect both a selective reporting bias in the original and a variability component between different potential replication studies. Since these two components cannot be separated on the basis of a single replication, more than one replication is usually needed, and a strategy for a "replication process" is required.

14 : 00 - 14 : 30

Experimental replications in animal trials

By Florian Frommlet, Medical University Vienna

The recent discussion on reproducibility of scientific results is particularly relevant for preclinical research with animal models. Within that research community there exists some tradition to repeat an experiment three times to demonstrate replicability. However, there are hardly any guidelines about how to plan for such an experimental design and also how to report the results obtained. This article provides a thorough statistical analysis of the 'three-times' rule as it is currently often applied in practice and gives some recommendations how to improve on study design and statistical analysis of replicated animal experiments.

14 : 30 - 15 : 00

Identifying boundary conditions in confirmatory preclinical animal studies to increase value and foster translation

By Meggie Danziger, QUEST Center for Transforming Biomedical Research at the Berlin Institute of Health

Background: Low statistical power in preclinical animal experiments has been repeatedly pointed out as a roadblock to successful replication and translation. If only a small number of tested interventions is effective (i. e. low pre-study odds), researchers should increase the power of their experiments to detect those true effects. This, however, contradicts ethical and budget constraints. To increase the scientific value of preclinical experiments under these constraints, it is necessary to devise strategies that result in maximally efficient confirmatory studies.

Methods: To this end, we explore different approaches to perform preclinical animal experiments via simulations. We model the preclinical research trajectory from the exploratory stage to the results of a within-lab confirmatory study. Critically, we employ different decision criteria that indicate when one should move from the exploratory stage to the confirmatory stage as well as various approaches to determine the sample size for a confirmatory study (smallest effect size of interest (SESOI), safeguard, and standard power analysis). At the confirmatory stage, different experimental designs (fixed-N and sequential with and without futility criterion) and types of analyses (two sample t-test and Bayes factor) are explored. The different trajectories of the research chain are compared regarding the number of experiments proceeding to the confirmatory stage, number of animals needed, positive predictive value (PPV), and statistical power.


15 : 00 - 15 : 30

Coffee break

15 : 30 - 16 : 15

Evaluating statistical evidence in biomedical research, meta-studies, and radical randomization

By Don van Ravenzwaaij, University of Groningen

For the endorsement of new medications, the US Food and Drug Administration requires replication of the main effect in randomized clinical trials. Typically, this replication comes down to observing two trials, each with a p-value below 0.05. In the first part of this talk, I discuss work from a simulation study (van Ravenzwaaij & Ioannidis, 2017) that shows what it means to have exactly two trials with a p-value below 0.05 in terms of the actual strength of evidence quantified by Bayes factors. Our results show that different cases where two trials have a p-value below 0.05 have wildly differing Bayes factors. In a non-trivial number of cases, evidence actually points to the null hypothesis. We recommend use of Bayes factors as a routine tool to assess endorsement of new medications, because Bayes factors consistently quantify strength of evidence. In the second part of this talk, I will propose a different way to go about replication: the use of meta-studies with radical randomization (Baribault et al, 2018).

16 : 15 - 17 : 00

The harmonic mean chi-squared test to substantiate scientific findings

By Leonhard Held, University of Zurich

A new significance test is proposed to substantiate scientific findings from multiple primary studies investigating the same research hypothesis. The test statistic is based on the harmonic mean of the squared study-specific test statistics and can also include weights. Appropriate scaling ensures that, for any number of studies, the null distribution is a chi-squared distribution with one degree of freedom. The null distribution can be used to compute a one-sided $p$-value or to ensure Type-I error control at a pre-specified level. Further properties are discussed and a comparison with FDA's two-trials rule for drug approval is made, as well as with alternative research synthesis methods. As a by-product, the approach provides a calibration of the sceptical p-value recently proposed for the analysis of replication studies.


19 : 00 - 22 : 00

Conference dinner

Location to be announced

24 Jan, 2020

09 : 00 - 09 : 45

The Replication Bayes factor and Beyond

By E.J. Wagenmakers, University of Amsterdam

Abstract coming soon

09 : 45 - 10 : 15

The sufficiently skeptical intrinsic prior

By Guido Consonni, Università Cattolica del Sacro Cuore, Milan

Abstract coming soon


10 : 15 - 10 : 45

Coffee break

10 : 45 - 11 : 15

A novel approach to meta-analysis testing under heterogeneity

By Judith ter Schure, PhD student, Centrum Wiskunde & Informatica Amsterdam

Scientific knowledge accumulates and therefore always has a (partly) sequential nature. As a result, the exchangeability assumption in conventional meta-analysis cannot be met if the existence of a replication — or generally: later studies in a series — depends on earlier results. Such dependencies arise at the study level but also at the meta-analysis level, if new studies are informed by a systematic review of existing results in order to reduce research waste. Fortunately, studies series with such dependencies can be meta-analyzed with Safe Tests. These tests preserve type I error control, even if the analysis is updated after each new study. Moreover, they introduce a novel approach to handling heterogeneity; a bottleneck in sequential meta-analysis. This strength of Safe Tests for composite null hypotheses lies in controlling type I errors over the entire set of null distributions by specifying the test statistic for a worst-case prior on the null. If for each study such a (study-specific) test statistic is provided, the combined test controls type I error even if each study is generated by a different null distribution. These properties are optimized in so-called GROW Safe Tests. Hence, they optimize the ability to reject the null hypothesis and make intermediate decisions in a growing series, without the need to model heterogeneity.

11 : 15 - 12 : 00

Efficient designs under uncertainty: Guarantee compelling evidence with Sequential Bayes Factor designs

By Felix Schönbrodt, Ludwig-Maximilians-Universität München

Unplanned optional stopping rules have been criticized for inflating Type I error rates under the null hypothesis significance testing (NHST) paradigm. Despite these criticisms, this research practice is not uncommon, probably because it appeals to researcher’s intuition to collect more data to push an indecisive result into a decisive region. Optionally increasing the sample size is one of the most common "questionable research practices" (John, Loewenstein, & Prelec, 2012). In my talk, I will present the "Sequential Bayes Factor" (SBF) design, which allows unlimited multiple testing for the presence or absence of an effect, even after each participant. Sampling is stopped as soon a pre-defined evidential threshold for H1 support or for H0 support is exceeded. Compared to an optimal NHST design, this leads on average to 50-70% smaller sample sizes, while having the same error rates. Furthermore, in contrast to NHST, its success is not dependent on a priori guesses of the true effect size. Finally, I give a quick overview over a priori Bayes factor design analysis (BFDA) which allosw to envisage the expected sample size (given an assumed true effect size), and also allow to set a reasonable maximal sample size for the sequential procedure.


12 : 00 - 13 : 30

Flying lunch

13 : 30 - 14 : 15

Shrinkage for reproducible research

By E.W. van Zwet, Leiden University Medical Center

The pressure to publish or perish undoubtedly leads to the publication of much poor research. However, the fact that significant effects tend to be smaller and less significant upon attempts to reproduce them, is also due to selection bias. I will discuss this "winner's curse" in some detail and show that it is largest in low powered studies. To correct for it, it is necessary to apply some shrinkage. To determine the appropriate amount of shrinkage, I propose to embed the study of interest into a large area of research, and then to estimate the distribution of effect sizes across that area of research. Using this estimated distribution as a prior, Bayes' rule provides the amount of shrinkage that is well-calibrated to the chosen area of research. I demonstrate the approach with data from the OSC project on reproducibility in psychology, and with data from 100 phase 3 clinical trials.

14 : 15 - 14 : 45

Probabilistic forecasting of replication studies

By Samuel Pawel, University of Zurich

Throughout the last decade, the so-called replication crisis has stimulated many researchers to conduct large-scale replication projects. With data from four of these projects, we computed probabilistic forecasts of the replication outcomes, which we then evaluated regarding discrimination, calibration and sharpness. A novel model, which can take into account both inflation and heterogeneity of effects, was used and predicted the effect estimate of the replication study with good performance in two of the four data sets. In the other two data sets, predictive per- formance was still substantially improved compared to the naive model which does not consider inflation and heterogeneity of effects. The results suggest that many of the estimates from the original studies were too optimistic, possibly caused by publication bias or questionable research practices, and also that some degree of heterogeneity should be expected. Moreover, the results indicate that the use of statistical significance as the only criterion for replication success may be questionable, since from a predictive viewpoint, non-significant replication results are often compatible with significant results from the original study


14 : 45 - 15 : 15

Coffee break

15 : 15 - 16 : 00


By Robert Matthews, Aston University Birmingham

Abstract coming soon


16 : 00 - 16 : 30

Final discussion

Registration is open

Please register until December 15, 2019


Limited seating available.

Practical Information

Suggestions for hotels


Hotel St. Josef
Hischengraben 64


Hotel Marta
Zähringerstrasse 36


Hotel Kafischnaps
Kornhausstrasse 57


Hotel Bristol
Stampfenbachstrasse 34

Contact information


EBPI, Hischengraben 84

Archive earlier events

Please find here information about previous events of the CRS