But I don't think that is a fair criticism.
To clarify, my point was not meant as a criticism of Bem himself or his paper. Bem was following the analytical standards of experimental psychology. Unfortunately, those standards allow for, indeed encourage, researcher degrees of freedom, which invalidate most published research in the field. Francis (2013a) found excess success in 82% of multi-experiment papers in Psychological Science and Francis et al (2014) found excess success in 83% of multi-experiment psychology papers in Science. What Francis, I, and others are criticizing are the accepted standards themselves, not the practitioners who are following them.
If there were a predetermined hypothesis, then the p-value for that hypothesis would be valid, whatever additional exploratory comparisons were made using the same data.
That's true: if there were a specific predetermined hypothesis that admitted a single statistical hypothesis, then the p-value for that test would be valid—for that test. The problem is that there is an overarching, more-general, hypothesis that each experiment is testing for which many tests could be conducted. And if the investigator would conduct these other tests if his "main" hypothesis test were not significant, and claim that the overarching hypothesis were supported if any one of these tests were significant, then none of these p-values (not even the "main" one) would be a valid p-value for the experiment. This would be true even if each p-value were valid for each test individually.
To put it another way, a valid p-value for the whole experiment would have to be calculated by taking into account how many options the investigators had, and what options would they take, depending on the outcome of their "main" test. This would likely be an impossible calculation to perform, because I doubt that experimenters know themselves what they would do ahead of time. This is why predetermined analysis protocols, which I suspect have been rarely employed in experimental psych, are important.
And every experiment conducted using both male and female subjects has the potential for an additional exploratory male-female comparison.
I agree that exploratory research is important, but investigators need to clearly separate their confirmatory hypotheses from their exploratory ones. If researchers would substitute an exploratory or secondary hypothesis for their main hypothesis if their main hypothesis test were insignificant then they inflate their Type I error rate for the experiment, and no p-value from any of the tests will be a valid p-value for the experiment. It's really just a multiple comparison problem. What is subtle is that the problem exists even if the main hypothesis test is significant and none of the secondary tests are actually performed. The Type I error rate is a long-run probability; hence a valid p-value for the experiment would have to take into account things that would have been done had the main test been non-significant (even when the main test is significant).
Last edited: