Interesting article on nature.com about p values. Scientific method: Statistical errors, P values, the ‘gold standard’ of statistical validity, are not as reliable as many scientists assume.
A couple of highlights:
P values have always had critics. In their almost nine decades of existence, they have been likened to mosquitoes (annoying and impossible to swat away), the emperor’s new clothes (fraught with obvious problems that everyone ignores) and the tool of a “sterile intellectual rake” who ravishes science but leaves it with no progeny3. One researcher suggested rechristening the methodology “statistical hypothesis inference testing”3, presumably for the acronym it would yield.
The irony is that when UK statistician Ronald Fisher introduced the P value in the 1920s, he did not mean it to be a definitive test. He intended it simply as an informal way to judge whether evidence was significant in the old-fashioned sense: worthy of a second look. The idea was to run an experiment, then see if the results were consistent with what random chance might produce. Researchers would first set up a ‘null hypothesis’ that they wanted to disprove, such as there being no correlation or no difference between two groups. Next, they would play the devil’s advocate and, assuming that this null hypothesis was in fact true, calculate the chances of getting results at least as extreme as what was actually observed. This probability was the P value. The smaller it was, suggested Fisher, the greater the likelihood that the straw-man null hypothesis was false.
For all the P value’s apparent precision, Fisher intended it to be just one part of a fluid, non-numerical process that blended data and background knowledge to lead to scientific conclusions. But it soon got swept into a movement to make evidence-based decision-making as rigorous and objective as possible.
More broadly, researchers need to realize the limits of conventional statistics, Goodman says. They should instead bring into their analysis elements of scientific judgement about the plausibility of a hypothesis and study limitations that are normally banished to the discussion section: results of identical or similar experiments, proposed mechanisms, clinical knowledge and so on. Statistician Richard Royall of Johns Hopkins Bloomberg School of Public Health in Baltimore, Maryland, said that there are three questions a scientist might want to ask after a study: ‘What is the evidence?’ ‘What should I believe?’ and ‘What should I do?’ One method cannot answer all these questions, Goodman says: “The numbers are where the scientific discussion should start, not end.”
Always be skeptical. One research project is not the definative answer, rather we need to triangulate upon the truth. One suggestion the article gave, that I thought would be a great idea is,
Simonsohn argues that one of the strongest protections for scientists is to admit everything. He encourages authors to brand their papers ‘P-certified, not P-hacked’ by including the words: “We report how we determined our sample size, all data exclusions (if any), all manipulations and all measures in the study.” This disclosure will, he hopes, discourage P-hacking, or at least alert readers to any shenanigans and allow them to judge accordingly.