The practice of Significance Testing (ST) remains widespread in psychological science despite continual criticism of its flaws and abuses. Using simulation experiments, we address four concerns about ST and for two of these we compare ST’s performance with prominent alternatives. We find the following: First, the

The

Recently, the

The controversy over best statistical practices emerged, in part, from historical accident. The long-standing prominence of ST made it a salient object of critical discussion. Individual alternative methods are sometimes seen as gaining in credibility inasmuch as a particular shortcoming of ST is demonstrated. Direct comparisons between ST and alternatives are rare, as are comparisons between or among those alternatives.

Our approach is to pose four questions regarding inductive inference, and then to assess ST’s performance – where possible in direct comparison with an alternative. We first address each question at the conceptual level and then seek quantitative answers in simulation experiments. The four questions are: [1] How well does the

The

A key concern about the

We sought to quantify how much p(D|H) reveals about p(H|D). Bayes’ Theorem, which expresses the mathematical relationship between the two inverse conditional probabilities, provides the first clues. The theorem

shows that as p(D|H) decreases,

We studied the results for a variety of settings in simulation experiments (

We sampled values for p(H), p(D|H), and p(D|∼H) and varied the size of the negative correlation between p(D|H) and p(D|∼H), with the result of interest being the correlation between p(D|H) and p(H|D), that is, the correlation indicating the predictive power of

Turning to the effect of researchers’ prior knowledge on the inductive power of

When raising the correlation between p(H) and p(D|H) to .5 and to .9, we respectively observe correlations of .628 and .891 between p(D|H) and p(H|D). This result suggests that as a research program matures, the

Consider research on the self-enhancement bias as another example for the use of ST in a mature research domain. After years of confirmatory findings, the researcher can predict that most respondents will regard themselves as above average when rating themselves and the average person on dimensions of personal importance (

A second concern about

Consider – like Jonathan Swift – two islands, one in which the Lilliputians are much shorter than the Blesfucians, and another in which there is no difference. Sampling heights from the no-effect island produces a uniform distribution of

We ask whether the relationship between

We now quantify the concept of inductive value of the statistical

Recall that

Table

The “Inductive Value” of

Sure-thing environment |
Uncertain environment |
Risky environment |
|||||||
---|---|---|---|---|---|---|---|---|---|

δ = .2 | .5 | .8 | δ = .2 | .5 | .8 | δ = .2 | .5 | .8 | |

N = 20 | 0.768 | 0.973 | 0.990 | 0.378 | 0.814 | 0.933 | –0.019 | 0.122 | 0.336 |

N = 50 | 0.924 | 0.990 | 0.992 | 0.671 | 0.948 | 0.983 | 0.133 | 0.543 | 0.775 |

N = 100 | 0.971 | 0.992 | 0.993 | 0.831 | 0.980 | 0.991 | 0.356 | 0.811 | 0.924 |

The index of inductive value is similar to other efforts to place

Two elements of Bayes’ Theorem combine to form the likelihod ratio, LR. The numerator of the formula presented earlier contains p(D|H) and thus the

When the alternative hypotheis refers to a specific point, the LR may also be referred to as the Bayes factor, BF (e.g.,

The estimation of a LR requires a specific alternative hypothesis, ∼H, in addition to the null hypothesis, H. Having to make this selection explicit is thought to eliminate the illusion of scooping up a “free lunch” (

We now assume that a specific alternative hypothesis has been chosen, and that multiple experiments can be performed. From this perspective, we see the close correspondence between the

One way to show that the LR can capture variation that is ignored by the

The point of this illustration is that the same

We now return to the question of how the

We therefore conducted simulation experiments, in which we varied both p(D|H) and p(D|∼H). We set the null distribution to μ = 10 and σ = 5, and chose a series of mean values for ∼H (11, 12.5, 14, 15, 17.5, 20, 22.5, 30, and 40) to represent alternatives with a spread of effect sizes (δ = .2, .5, .8, 1, 1.5, 2, 2.5, and 4). Next, we chose the three

Likelihood Ratio to

Simulation Parameters | δ | _{∼H} |
_{∼H} |
p(D|∼H) | LR |
---|---|---|---|---|---|

0.2 | 11 | 1.76 | 0.079 | 0.689 | |

H = 10 | 0.5 | 12.5 | 1.46 | 0.145 | 0.425 |

z = 1.96 | 0.8 | 14 | 1.16 | 0.246 | 0.287 |

1 | 15 | 0.96 | 0.337 | 0.232 | |

1.5 | 17.5 | 0.46 | 0.646 | 0.163 | |

2 | 20 | 0.04 | 0.968 | 0.147 | |

2.5 | 22.5 | 0.54 | 0.589 | 0.169 | |

4 | 30 | 2.04 | 0.042 | 1.174 | |

0.2 | 11 | 2.38 | 0.017 | 0.609 | |

H = 10 | 0.5 | 12.5 | 2.08 | 0.038 | 0.312 |

z = 2.58 | 0.8 | 14 | 1.78 | 0.075 | 0.175 |

1 | 15 | 1.58 | 0.114 | 0.125 | |

1.5 | 17.5 | 1.08 | 0.280 | 0.064 | |

2 | 20 | 0.58 | 0.562 | 0.042 | |

2.5 | 22.5 | 0.08 | 0.936 | 0.036 | |

4 | 30 | 1.42 | 0.156 | 0.098 | |

0.2 | 11 | 3.10 | 0.002 | 0.527 | |

H = 10 | 0.5 | 12.5 | 2.80 | 0.005 | 0.218 |

z = 3.30 | 0.8 | 14 | 2.50 | 0.013 | 0.098 |

1 | 15 | 2.30 | 0.022 | 0.061 | |

1.5 | 17.5 | 1.80 | 0.072 | 0.022 | |

2 | 20 | 1.30 | 0.194 | 0.010 | |

2.5 | 22.5 | 0.80 | 0.424 | 0.006 | |

4 | 30 | 0.70 | 0.484 | 0.006 | |

We have considered some low values for

To conclude this section, we observe that the relationship between the LR and the _{XY} becomes more negative. For example, _{X,X/Y} = .34 and .83 respectively for _{XY} = .5 and -.5. Small nonlinearities remain so that the best-fitting associations are even stronger. Figure _{XY}. The values for _{X,X/Y} remain positive even under the least favorable conditions (i.e., when X and Y are increasingly redundant). The reason why the correlations are less than perfect is simply the researcher’s ignorance of what the research hypothesis (∼H) might be. It is neither a specific prediction nor a default-diffuse one.

The correlation between a (log-transformed) ratio and its (log-transformed) numerator for different input correlations between numerator and denominator.

We have seen that the LR can improve inductive inferences if a well-reasoned alternative hypothesis is available. A researcher who wishes to estimate the posterior probability of the null hypothesis, p(H|D), is better served by knowing p(D|H) and p(D|∼H) than by knowing only the former. Yet, we also saw that the

One long-standing alternative to

The Open Science Collaboration (

ST and the CI approaches use different definitions of replication (see also

Inspection of Cumming’s CI criterion reveals potentially awkward patterns. The second mean might lie within the CI of the first mean but have a different sign. This would be consistent with the ST view that a null finding was replicated, but the CI approach does not refer to a null hypothesis. So what has been replicated? Another concern involves sample size. As

We sampled observations from a distribution with μ = 55 and σ = 10 (i.e., δ = .5 relative to the null distribution of μ = 50 and σ = 10), computed a 95% CI around each observed mean, and conducted a one-sample

CI and

SD of |
|||||
---|---|---|---|---|---|

10 | 14.13 | 14.04 | 3.24 | 0.2467 | 0.1455 |

20 | 9.33 | 9.24 | 2.31 | 0.1249 | 0.0381 |

30 | 7.41 | 7.39 | 1.85 | 0.0635 | 0.0114 |

40 | 6.32 | 6.28 | 1.61 | 0.0292 | 0.0025 |

50 | 5.68 | 5.67 | 1.35 | 0.0125 | 0.0010 |

60 | 5.13 | 5.10 | 1.27 | 0.0068 | 0.0002 |

70 | 4.76 | 4.75 | 1.21 | 0.0039 | 0.0001 |

80 | 4.45 | 4.45 | 1.08 | 0.0015 | 0.0000 |

90 | 4.20 | 4.18 | 1.10 | 0.0011 | 0.0000 |

100 | 3.97 | 3.96 | 0.97 | 0.0005 | 0.0000 |

We then estimated of the probability of a successful replication using both the CI and the ST frameworks. Assuming a false null hypothesis (i.e., p(H) = 0), we simulated the probability with which the mean obtained in one simulated experiment would fall within the CI of another experiment. The results in Table ^{2}) approaches 1. Over this series of simulations, the median probability of replication is remarkably similar for both the CI (

Probability of replication with CI and NHST.

Confidence Interval Approach | NHST Approach | |||
---|---|---|---|---|

SD p(rep) | p ( |
p (sign.^{2}) |
||

10 | 0.854 | 0.129 | 0.279 | 0.078 |

20 | 0.834 | 0.143 | 0.545 | 0.296 |

30 | 0.836 | 0.145 | 0.732 | 0.535 |

40 | 0.829 | 0.148 | 0.85 | 0.722 |

50 | 0.856 | 0.138 | 0.95 | 0.902 |

60 | 0.844 | 0.148 | 0.971 | 0.942 |

70 | 0.833 | 0.151 | 0.983 | 0.965 |

80 | 0.849 | 0.140 | 0.994 | 0.987 |

90 | 0.820 | 0.158 | 0.996 | 0.991 |

100 | 0.851 | 0.146 | 0.999 | 0.997 |

^{2}) = probability of finding significance (

If the replicability of research findings is in question, the CI measure ignores the power of large studies to repeatedly yield the same result. When

Significance testing, ST, is meant to support statistical inference under uncertainty. As any method of inductive inference, ST faces many challenges, and it has been difficult to find a balanced evaluation of its strengths and weaknesses. Recently, diverse proposals have been made to reform statistical practice, such as lowering the threshold for statistical significance, adding alternative methods, and even abandoning ST altogether. As some of the uncertainty raised by questions of induction are irreducible, it is necessary to explore not only the strengths and weakness of a particular method, but to also ask how the balance of strengths and weaknesses compares with the strengths and weaknesses of other available methods. A comprehensive review of all methods along all possible criteria of validity is beyond the scope of any particular investigation. We therefore focused on four questions in the critical literature.

Using simulation experiments, we reproduced the statistical patterns associated with each concern. Then we showed that each concern is valid in the context of specific assumptions. By making assumptions more flexibly, we sought a broader evaluation of ST. In each of four areas of concern, we found that

To review, we first addressed the concern that

Second, we addressed the concern that

Third, we compared likelihood ratios with

Fourth, we addressed the claim that confidence intervals provide better estimates of the replicability of an empirical result. We find that CI overlaps are uniformly large and that this is not a useful feature for the estimation of replicability. The width of the CI for a particular sample mean is highly correlated with the variability of means over different sample sizes. Therefore, estimates of replicability performed with CI are insensitive to statistical power. If the results of studies with high power are to be regarded as more predictive of replication than the results of studies with low power, ST should be preferred. The blindspot of ST lies elsewhere. When two studies yield significant results but very different effect size estimates, a CI analysis takes note, whereas ST does not. We urge researchers to pay close attention to effect sizes and CI as well as

One of David Hume’s lasting legacies is to have shown that no method of inductive inference can be justified deductively (

We began this article with a note of how the debate over statistical analysis is often framed as an inquisition into the flaws of ST. The present article too is cast in the mold of this ongoing controversy. There have been many critical assessments (e.g.,

There is a lesson for future comparative efforts and intervention of institutional task forces. Instead of setting up ST as a defendant awaiting a verdict, it might be useful to articulate the mission of inductive inference and specific questions and challenges arising from it. We encourage careful consideration of when a statistical test might be necessary, and when estimation methods, unrestricted by dichotomania (

Matlab code for simulations [1] [2] and [4] can be found on Patrick Heck’s website

This does not mean that a low

More generally, it is extremity that is inversely related to probability. Catastrophes are rarer than mishaps much like great joy is rarer than a pleasant mood.

The argument presented in this section may also be stated as a probabilistic reverse inference. If there is an effect,

The same holds true for the upper-Bayes-factor-bound proposed by Bayarri et al. (

Simonsohn (

We thank Joe Austerweil, Dan Balliet, Dan Benjamin, Tony Evans, Florian Kutzner, and Jan Rummel, for sharing their insightful ideas about NHST and its limitations.

The authors have no competing interests to declare.

JIK and PRH contributed equally to this work. JIK drafted the manuscript. PRH wrote the simulations and analyzed the simulated data. JIK and PRH revised the manuscript.