Science is arguably the best tool humans have devised to objectively explore the hows and whys of the world around us. One of the hallmark characteristics of the scientific process is that it is reproducible, meaning that the existence of any phenomenon should not depend on who is looking at it. Modern science is facing what is being called a 'reproducibility crisis,' however, as researchers recognize that many key results cannot be duplicated. This increasing trend is casting doubt on the reliability of scientific findings.
The results of the largest reproducibility study to date are in, and they are shocking. Brian Nosek, a social psychologist and the head of the Center for Open Science, collaborated with over 250 other researchers to try to reproduce 100 psychological experiments from three top scientific journals. They found that only 39 percent of the experiments they examined could be replicated. In all fairness to psychology, this figure is based subjectively on whether the researchers deemed the result duplicated or not.
Even in applying statistical techniques instead of human judgment, the implications of these results are bleak. The most commonly relied upon statistical parameter in modern biology and psychology is the p-value, which is essentially measures of the probability that a result will happen randomly. So, the smaller the p-value, the smaller the probability that an outcome occurred at random and the more likely it is that what you’re observing is a real effect. Researchers have set a threshold for the p-value at 0.05, and any result with a p-value smaller than that threshold is called 'significant' and anything else is not significant.
Almost all of the original experiments that Nosek’s team looked at had significant results, but upon replication, they found that only 35 percent had p-values lower than the significance threshold. Even though this finding sounds statistically conclusive, Nosek doesn’t give it much weight, mostly because many scientists reject the idea that the difference between a true and false result is a single bright-line rule like the significance test; the real world is not so cut and dried.
Another common statistical metric used to evaluate data is called the effect size. As the name suggests, it is a measure of how robust a result is. There are many ways to measure the effect size, and a very common method uses the correlation coefficient to tell how closely correlated two effects are. For example, doing one’s homework is strongly correlated with getting good grades, while the innie/outie status of one’s belly button probably isn’t correlated with academic performance at all.
When comparing the correlation coefficient of original and replicated data, Nosek found that in 82 percent of experiments, the original effect size was larger than the reproduced, striking another blow against the reliability of these findings. Nosek claims his findings are not meant to discredit the scientific process but to draw attention to science behaving as it should—endlessly verifying what we 'know' to be true.
I have been picking on psychology simply because the most comprehensive reproducibility project was undertaken in this field, but irreproducibility is a problem in all sciences. A similar project is underway in the field of cancer biology as many biologists bemoan the lack of reproducibility of many experiments. Some have blamed this on the fickleness of antibodies, the protein workhorses of molecular medical research, which some researchers claim may not act as specifically as expected to. In physics someone occasionally publishes a result claiming to have measured a particle travelling faster than light, which is forbidden by Einstein’s theory of relativity. These findings are not capable of being reproduced and so the results are debunked.
After realizing that irreproducibility is a problem, we must ask what causes it. Nosek and others suspect the root of the problem is twofold: a combination of faulty statistical procedure and powerful incentives for scientists to publish low-quality work. It all starts with those pesky p-values. It’s not really the p-value's fault—we ask more of the it than it can provide. Strictly, the p-value takes data and tells you how likely that data is to have been generated by chance, but what we often ask the p-value to tell us is whether our hypothesis is true. P-values can tell us whether results are significant, not whether the effect we are trying to measure is real.
The other side of the problem is that fierce competition in academia incentivizes publication of novel results over reproduction of previous findings. These new results may be totally wrong, but as long as they are significant, they are publishable. This combination has flooded the literature with false positives and bad research. Fortunately, scientists are catching on that reproducing old findings is important, and improvements are being made to the publication process. But for now, remain cautious and don’t believe everything you read.