Due to its probabilistic nature, Null Hypothesis Significance Testing (NHST) is subject to decision errors. The concern for false positives has overshadowed the concern for false negatives in the recent debates in psychology. This might be unwarranted, since reported statistically nonsignificant findings may just be ‘too good to be false’. We examined evidence for false negatives in nonsignificant results in three different ways. We adapted the Fisher test to detect the presence of at least one false negative in a set of statistically nonsignificant results. Simulations show that the adapted Fisher method generally is a powerful method to detect false negatives. We examined evidence for false negatives in the psychology literature in three applications of the adapted Fisher method. These applications indicate that (i) the observed effect size distribution of nonsignificant effects exceeds the expected distribution assuming a null-effect, and approximately two out of three (66.7%) psychology articles reporting nonsignificant results contain evidence for at least one false negative, (ii) nonsignificant results on gender effects contain evidence of true nonzero effects, and (iii) the statistically nonsignificant replications from the Reproducibility Project Psychology (RPP) do not warrant strong conclusions about the absence or presence of true zero effects underlying these nonsignificant results. We conclude that false negatives deserve more attention in the current debate on statistical practices in psychology. Potentially neglecting effects due to a lack of statistical power can lead to a waste of research resources and stifle the scientific discovery process.

Popper’s (

Null Hypothesis Significance Testing (NHST) is the most prevalent paradigm for statistical hypothesis testing in the social sciences (_{0} is tested, where _{0} most often regards the absence of an effect. If deemed false, an alternative, mutually exclusive hypothesis _{1} is accepted. These decisions are based on the _{0} is true. If the _{0} is rejected and _{1} is accepted.

Table _{0} is true in the population, but _{1} is accepted (‘_{1}’), a Type I error is made (_{1} is true in the population and _{0} is accepted (‘_{0}’), a Type II error is made (_{0} is accepted (‘_{0}’), this is a true negative (upper left cell; 1 _{1} is accepted (‘_{1}’), this is a true positive (lower right cell). The probability of finding a statistically significant result if _{1} is true is the power (1

Summary table of possible NHST results. Columns indicate the true situation in the population, rows indicate the decision based on a statistical test. The true positive probability is also called power and sensitivity, whereas the true negative rate is also called specificity.

Population | |||
---|---|---|---|

Decision | ‘_{0}’ |
_{0} |
_{1} |

‘_{1}’ |
1 |

Unfortunately, NHST has led to many misconceptions and misinterpretations (e.g.,

Recent debate about false positives has received much attention in science and psychological science in particular. The Reproducibility Project Psychology (RPP), which replicated 100 effects reported in prominent psychology journals in 2008, found that only 36% of these effects were statistically significant in the replication (

The debate about false positives is driven by the current overemphasis on statistical significance of research results (

The concern for false positives has overshadowed the concern for false negatives in the recent debate, which seems unwarranted. Cohen (

The research objective of the current paper is to examine evidence for false negative results in the psychology literature. To this end, we inspected a large number of nonsignificant results from eight flagship psychology journals. First, we compared the observed effect distributions of nonsignificant results for eight journals (combined and separately) to the expected null distribution based on simulations, where a discrepancy between observed and expected distribution was anticipated (i.e., presence of false negatives). Second, we propose to use the Fisher test to test the hypothesis that _{0} is true for all nonsignificant results reported in a paper, which we show to have high power to detect false negatives in a simulation study. Third, we applied the Fisher test to the nonsignificant results in 14,765 psychology papers from these eight flagship psychology journals to inspect how many papers show evidence of at least one false negative result. Fourth, we examined evidence of false negatives in reported gender effects. Gender effects are particularly interesting, because gender is typically a control variable and not the primary focus of studies. Hence we expect little

We begin by reviewing the probability density function of both an individual _{0}. We also propose an adapted Fisher method to test whether nonsignificant results deviate from _{0} within a paper. These methods will be used to test whether there is evidence for false negatives in the psychology literature.

The distribution of one

Considering that the present paper focuses on false negatives, we primarily examine nonsignificant

where _{i}_{i}

We applied the Fisher test to inspect whether the distribution of observed nonsignificant _{0}. The Fisher test was initially introduced as a meta-analytic technique to synthesize results across studies (_{0} in a set of nonsignificant

where ^{2} has 2^{2} value indicates more evidence for at least one false negative in the set of

We estimated the power of detecting false negatives with the Fisher test as a function of sample size

Table

Power of Fisher test to detect false negatives for small- and medium effect sizes (i.e.,

0.151 | 0.211 | 0.341 | 0.575 | 0.852 | 0.983 | |

0.175 | 0.267 | 0.459 | 0.779 | 0.978 | 1 | |

0.201 | 0.317 | 0.572 | 0.894 | 1 | 1 | |

0.208 | 0.352 | 0.659 | 0.948 | 1 | 1 | |

0.229 | 0.390 | 0.719 | 0.975 | 1 | 1 | |

0.251 | 0.434 | 0.784 | 0.990 | 1 | 1 | |

0.259 | 0.471 | 0.834 | 0.995 | 1 | 1 | |

0.280 | 0.514 | 0.871 | 0.998 | 1 | 1 | |

0.298 | 0.530 | 0.895 | 1 | 1 | 1 | |

0.304 | 0.570 | 0.918 | 1 | 1 | 1 | |

0.362 | 0.691 | 0.980 | 1 | 1 | 1 | |

0.429 | 0.780 | 0.996 | 1 | 1 | 1 | |

0.490 | 0.852 | 1 | 1 | 1 | 1 | |

0.531 | 0.894 | 1 | 1 | 1 | 1 | |

0.578 | 0.930 | 1 | 1 | 1 | 1 | |

0.621 | 0.953 | 1 | 1 | 1 | 1 | |

0.654 | 0.966 | 1 | 1 | 1 | 1 | |

0.686 | 0.976 | 1 | 1 | 1 | 1 |

To put the power of the Fisher test into perspective, we can compare its power to reject the null based on one statistically nonsignificant result (

To show that statistically nonsignificant results do not warrant the interpretation that there is truly no effect, we analyzed statistically nonsignificant results from eight major psychology journals. First, we investigate if and how much the distribution of reported nonsignificant effect sizes deviates from what the expected effect size distribution is if there is truly no effect (i.e., _{0}). Second, we investigate how many research articles report nonsignificant results and how many of those show evidence for at least one false negative using the Fisher test (

APA style ^{2} results, which we did not use in our analyses because effect sizes based on these results are not readily mapped on the correlation scale. Two erroneously reported test statistics were eliminated, such that these did not confound results.

Summary table of articles downloaded per journal, their mean number of results, and proportion of (non)significant results. Statistical significance was determined using

Journal (Acronym) | Time frame | Results | Mean results per article | Significant (%) | Nonsignificant (%) |
---|---|---|---|---|---|

Developmental Psychology (DP) | 1985–2013 | 30,920 | 13.5 | 24,584 |
6,336 |

Frontiers in Psychology (FP) | 2010–2013 | 9,172 | 14.9 | 6,595 |
2,577 |

Journal of Applied Psychology (JAP) | 1985–2013 | 11,240 | 9.1 | 8,455 |
2,785 |

Journal of Consulting and Clinical Psychology (JCCP) | 1985–2013 | 20,083 | 9.8 | 15,672 |
4,411 |

Journal of Experimental Psychology: General (JEPG) | 1985–2013 | 17,283 | 22.4 | 12,706 |
4,577 |

Journal of Personality and Social Psychology (JPSP) | 1985–2013 | 91,791 | 22.5 | 69,836 |
21,955 |

Public Library of Science (PLOS) | 2003–2013 | 28,561 | 13.2 | 19,696 |
8,865 |

Psychological Science (PS) | 2003–2013 | 14,032 | 9 | 10,943 |
3,089 |

The analyses reported in this paper use the recalculated

First, we compared the observed nonsignificant effect size distribution (computed with observed test results) to the expected nonsignificant effect size distribution under _{0}. The expected effect size distribution under _{0} was approximated using simulation. We first randomly drew an observed test result (with replacement) and subsequently drew a random nonsignificant _{0}). Based on the drawn _{0}, assuming independence of test results in the same paper. We inspected this possible dependency with the intra-class correlation (

Second, we applied the Fisher test to test how many research papers show evidence of at least one false negative statistical result. To recapitulate, the Fisher test tests whether the distribution of observed nonsignificant _{0}. In order to compute the result of the Fisher test, we applied equations 1 and 2 to the recalculated nonsignificant

Figure

Density of observed effect sizes of results reported in eight psychology journals, with 7% of effects in the category none-small, 23% small-medium, 27% medium-large, and 42% beyond large.

Our dataset indicated that more nonsignificant results are reported throughout the years, strengthening the case for inspecting potential false negatives. The proportion of reported nonsignificant results showed an upward trend, as depicted in Figure

Observed proportion of nonsignificant test results per year.

For the entire set of nonsignificant results across journals, Figure _{0}, 46% of all observed effects is expected to be within the range 0

Observed and expected (adjusted and unadjusted) effect size distribution for statistically nonsignificant APA results reported in eight psychology journals. Grey lines depict expected values; black lines depict observed values. The three vertical dotted lines correspond to a small, medium, large effect, respectively. Header includes Kolmogorov-Smirnov test results.

Because effect sizes and their distribution typically overestimate population effect size ^{2}, particularly when sample size is small (_{0} only 22% is expected.

The Fisher test was applied to the nonsignificant test results of each of the 14,765 papers separately, to inspect for evidence of false negatives. More technically, we inspected whether _{0} (i.e., uniformity). If _{0} is in fact true, our results would be that there is evidence for false negatives in 10% of the papers (a meta-false positive). Table

Summary table of Fisher test results applied to the nonsignificant results (

Overall | DP | FP | JAP | JCCP | JEPG | JPSP | PLOS | PS | ||
---|---|---|---|---|---|---|---|---|---|---|

Nr. of papers | 14,765 | 2,283 | 614 | 1,239 | 2,039 | 772 | 4,087 | 2,166 | 1,565 | |

Count | 4,340 | 758 | 133 | 488 | 907 | 122 | 840 | 565 | 527 | |

% | 29.4% | 33.2% | 21.7% | 39.4% | 44.5% | 15.8% | 20.6% | 26.1% | 33.7% | |

Evidence FN | 57.7% | 66.1% | 41.2% | 48.7% | 58.7% | 51.4% | 66.0% | 47.2% | 56.4% | |

Count | 2,510 | 433 | 102 | 238 | 380 | 109 | 556 | 339 | 353 | |

Evidence FN | 60.6% | 66.9% | 50.0% | 36.3% | 57.7% | 66.7% | 75.2% | 51.6% | 57.1% | |

Count | 1,768 | 293 | 64 | 157 | 227 | 81 | 424 | 289 | 233 | |

Evidence FN | 65.3% | 69.8% | 57.6% | 53.1% | 54.4% | 77.1% | 80.6% | 47.8% | 60.2% | |

Count | 1,257 | 199 | 66 | 98 | 125 | 83 | 341 | 184 | 161 | |

Evidence FN | 68.7% | 75.0% | 63.8% | 53.1% | 69.7% | 67.9% | 81.4% | 52.7% | 62.5% | |

Count | 892 | 128 | 47 | 64 | 89 | 56 | 264 | 148 | 96 | |

5 ≤ |
Evidence FN | 72.3% | 71.2% | 67.7% | 56.7% | 66.3% | 71.2% | 87.1% | 52.4% | 63.0% |

Count | 2,394 | 326 | 124 | 134 | 208 | 163 | 898 | 368 | 173 | |

10 ≤ |
Evidence FN | 77.7% | 76.9% | 67.7% | 60.0% | 72.4% | 81.2% | 88.1% | 57.3% | 81.0% |

Count | 1,280 | 121 | 65 | 55 | 87 | 117 | 596 | 218 | 21 | |

Evidence FN | 84.0% | 76.0% | 53.8% | 60.0% | 87.5% | 80.5% | 94.0% | 69.1% | 0.0% | |

Count | 324 | 25 | 13 | 5 | 16 | 41 | 168 | 55 | 1 | |

All | Evidence FN | 47.1% | 46.5% | 45.1% | 29.9% | 34.3% | 59.1% | 64.6% | 38.4% | 39.3% |

Evidence FN |
66.7% | 69.6% | 57.6% | 49.4% | 61.7% | 70.2% | 81.3% | 51.9% | 59.2% | |

Count | 6,951 | 1,061 | 277 | 371 | 699 | 456 | 2,641 | 831 | 615 |

Table

As would be expected, we found a higher proportion of articles with evidence of at least one false negative for higher numbers of statistically nonsignificant results (

We also checked whether evidence of at least one false negative at the article level changed over time. Figure

Proportion of papers reporting nonsignificant results in a given year, showing evidence for false negative results. Larger point size indicates a higher mean number of nonsignificant results reported in that year.

The result that 2 out of 3 papers containing nonsignificant results show evidence of at least one false negative empirically verifies previously voiced concerns about insufficient attention for false negatives (

The repeated concern about power and false negatives throughout the last decades seems not to have trickled down into substantial change in psychology research practice. Cohen (

Sample size development in psychology throughout 1985–2013, based on degrees of freedom across 258,050 test results. P25 = 25th percentile. P50 = 50th percentile (i.e., median). P75 = 75th percentile.

However, what has changed is the amount of nonsignificant results reported in the literature. Our data show that more nonsignificant results are reported throughout the years (see Figure

In order to illustrate the practical value of the Fisher test to test for evidential value of (non)significant

We planned to test for evidential value in six categories (expectation [3 levels] _{1} expected’, ‘_{0} expected’, or ‘no expectation’. Prior to data collection, we assessed the required sample size for the Fisher test based on research on the gender similarities hypothesis (

We sampled the 180 gender results from our database of over 250,000 test results in four steps. First, we automatically searched for “gender”, “sex”, “female” AND “male”, “man” AND “woman” [sic], or “men” AND “women” [sic] in the 100 characters before the statistical result and 100 after the statistical result (i.e., range of 200 characters surrounding the result), which yielded 27,523 results. Second, the first author inspected 500 characters before and after the first result of a randomly ordered list of all 27,523 results and coded whether it indeed pertained to gender. This was done until 180 results pertaining to gender were retrieved from 180 different articles. Third, these results were independently coded by all authors with respect to the expectations of the original researcher(s) (coding scheme available at

Prior to analyzing these 178

The coding of the 178 results indicated that results rarely specify whether these are in line with the hypothesized effect (see Table _{0} expected, nonsignificant-_{0} expected, and nonsignificant-_{1} expected contained too few results for meaningful investigation of evidential value (i.e., with sufficient statistical power).

Number of gender results coded per condition in a 2 (significance: significant or nonsignificant) by 3 (expectation: _{0} expected, _{1} expected, or no expectation) design. Cells printed in bold had sufficient results to inspect for evidential value.

_{0} expected |
_{1} expected |
No expectation | |
---|---|---|---|

Significant | 0 | ||

Nonsignificant | 2 | 1 |

Figure ^{2}(22) = 358.904, ^{2}(15) = 1094.911, ^{2}(174) = 324.374,

Probability density distributions of the

We observed evidential value of gender effects both in the statistically significant (no expectation or H_{1} expected) and nonsignificant results (no expectation). The data from the 178 results we investigated indicated that in only 15 cases the expectation of the test result was clearly explicated. This indicates that based on test results alone, it is very difficult to differentiate between results that relate to a priori hypotheses and results that are of an exploratory nature. The importance of being able to differentiate between confirmatory and exploratory results has been previously demonstrated (

Out of the 100 replicated studies in the RPP, 64 did not yield a statistically significant effect size, despite the fact that high replication power was one of the aims of the project (

Of the 64 nonsignificant studies in the RPP data (_{0} of no evidential value ^{2}-distributed with 126 degrees of freedom.

Subsequently, we hypothesized that ^{2}-value exceeds _{Y}_{LB}_{UB}_{LB}_{Y}_{UB}_{Y}

We computed _{Y}

Randomly selected

Given the degrees of freedom of the effects, we randomly generated _{0} using the central distributions and non-central distributions (for the 63

The Fisher statistic

Probability _{Y}

Upon reanalysis of the 63 statistically nonsignificant replications within RPP we determined that many of these “failed” replications say hardly anything about whether there are truly no effects when using the adapted Fisher method. The Fisher test of these 63 nonsignificant results indicated some evidence for the presence of at least one false negative finding (^{2}(126) = 155.2382, _{Y}_{Y}

The reanalysis of the nonsignificant RPP results using the Fisher method demonstrates that any conclusions on the validity of individual effects based on “failed” replications, as determined by statistical significance, is unwarranted. This was also noted by both the original RPP team (

Very recently four statistical papers have re-analyzed the RPP results to either estimate the frequency of studies testing true zero hypotheses or to estimate the individual effects examined in the original and replication study. All four papers account for the possibility of publication bias in the original study. Johnson, Payne, Wang, Asher, and Mandal (

Much attention has been paid to false positive results in recent years. Our study demonstrates the importance of paying attention to false negatives alongside false positives. We examined evidence for false negatives in nonsignificant results in three different ways. Specifically, we adapted the Fisher method to detect the presence of at least one false negative in a set of statistically nonsignificant results. Simulations indicated the adapted Fisher test to be a powerful method for that purpose. The three applications indicated that (i) approximately two out of three psychology articles reporting nonsignificant results contain evidence for at least one false negative, (ii) nonsignificant results on gender effects contain evidence of true nonzero effects, and (iii) the statistically nonsignificant replications from the Reproducibility Project Psychology (RPP) do not warrant strong conclusions about the absence or presence of true zero effects underlying these nonsignificant results (RPP does yield less biased estimates of the effect; the original studies severely overestimated the effects of interest).

The methods used in the three different applications provide crucial context to interpret the results. In applications 1 and 2, we did not differentiate between main and peripheral results. Hence, the interpretation of a significant Fisher test result pertains to the evidence of at least one false negative in all reported results, not the evidence for at least one false negative in the main results. Nonetheless, even when we focused only on the main results in application 3, the Fisher test does not indicate specifically which result is false negative, rather it only provides evidence for a false negative in a set of results. As such, the Fisher test is primarily useful to test a set of potentially underpowered results in a more powerful manner, albeit that the result then applies to the complete set. Additionally, in applications 1 and 2 we focused on results reported in eight psychology journals; extrapolating the results to other journals might not be warranted given that there might be substantial differences in the type of results reported in other journals or fields.

More generally, our results in these three applications confirm that the problem of false negatives in psychology remains pervasive. Previous concern about power (

Reducing the emphasis on binary decisions in individual studies and increasing the emphasis on the precision of a study might help reduce the problem of decision errors (

For all three applications, the Fisher tests’ conclusions are limited to detecting at least one false negative in a

Another potential caveat relates to the data collected with the R package statcheck and used in applications 1 and 2. statcheck extracts inline, APA style reported test statistics, but does not include results included from tables or results that are not reported as the APA prescribes. Consequently, our results and conclusions may not be generalizable to

Given that the results indicate that false negatives are still a problem in psychology, albeit slowly on the decline in published research, further research is warranted. Further research could focus on comparing evidence for false negatives in main and peripheral results. Our results in combination with results of previous studies suggest that publication bias mainly operates on results of tests of main hypotheses, and less so on peripheral results. Another venue for future research is using the Fisher test to re-examine evidence in the literature on certain other effects or often-used covariates, such as age and race, or to see if it helps researchers prevent dichotomous thinking with individual

Finally, the Fisher test may and is also used to meta-analyze effect sizes of different studies. Whereas Fisher used his method to test the null-hypothesis of an underlying true zero effect using several studies’

To conclude, our three applications indicate that false negatives remain a problem in the psychology literature, despite the decreased attention and that we should be wary to interpret statistically nonsignificant results as there being no effect in reality. One way to combat this interpretation of statistically nonsignificant results is to incorporate testing for potential false negatives, which the Fisher method facilitates in a highly approachable manner (a spreadsheet for carrying out such a test is available at

The Fisher test to detect false negatives is only useful if it is powerful enough to detect evidence of at least one false negative result in papers with few nonsignificant results. Therefore we examined the specificity and sensitivity of the Fisher test to test for false negatives, with a simulation study of the one sample _{Fisher}_{2}) in the observed dataset for Application 1. Each condition contained 10,000 simulations. The power of the Fisher test for one condition was calculated as the proportion of significant Fisher test results given _{Fisher}

We simulated false negative ^{2}/1 – ^{2})

Visual aid for simulating one nonsignificant test result. The critical value from _{0} (left distribution) was used to determine _{1} (right distribution). A value between 0 and _{0} determined.

We repeated the procedure to simulate a false negative

The ^{2}, which is the explained variance for that test result and ranges between 0 and 1, for comparing observed to expected effect size distributions. For ^{2}).

Where ^{2} and _{1} = 1 for

Which shows that when

Where

Where

JMW received funding from the Dutch Science Funding (NWO; 016-125-385) and all authors are (partially-)funded by the Office of Research Integrity (ORI; ORIIR160019).

All research files, data, and analyses scripts are preserved and made available for download at