Was Bem's "Feeling the Future" paper exploratory?

But the experiment itself can't test hypotheses, can it?

Well, yes, it can. That is, if you have collected data on the responses of people to positive/low-arousal pictures, in addition to the responses to negative/high-arousal pictures, then the experiment can test a positive/low-arousal hypothesis, a negative/high-arousal hypothesis, and a general hypothesis. If you intend to test a negative/high-arousal hypothesis, then you wouldn't collect data on positive/high-arousal responses, and a test of a positive/high-arousal hypothesis wouldn't be an option.

That's up to the experimenter. So we very much can't ignore what the experimenter says about what hypothesis the experiment was intended to test.

I suppose it's up to the experimenter which results they want to select out of the pool of data. But the point is that once you have a pool, it doesn't matter what your intentions were.

Linda
 
But the point is that once you have a pool, it doesn't matter what your intentions were.
So does the pool of data that Bem collected show a positive effect of precognition for erotic stimulus?

Cheers,
Bill
 
Linda

You seem to be saying that if an experimenter sets out to test a hypothesis experimentally, then the experimental test is invalidated if the resulting data can be used to test other hypotheses, even if the experimenter has no intention of testing other hypotheses and doesn't do so.

Are you perhaps attempting a reductio ad absurdum of the criticisms of Bem's paper?
 
  • Like
Reactions: K9!
Linda

You seem to be saying that if an experimenter sets out to test a hypothesis experimentally, then the experimental test is invalidated if the resulting data can be used to test other hypotheses, even if the experimenter has no intention of testing other hypotheses and doesn't do so.

What do you mean by "invalidated"?

Why would an experimenter collect data which could be used to test other hypotheses, but is irrelevant to the hypothesis they intend to test?

Are you perhaps attempting a reductio ad absurdum of the criticisms of Bem's paper?

No, but it's interesting that you make the suggestion. You said you didn't understand the criticisms from Gelman and Loken.

Linda
 
Linda

Actually, what I said about those criticisms was that they didn't make sense to me. [Note to Super Sexy: That isn't a link.]

Your interpretation makes even less sense to me. By "invalidated", I mean that the statistical test of the hypothesis is made invalid. If that's not what you are saying, all well and good.
 
Linda

Actually, what I said about those criticisms was that they didn't make sense to me. [Note to Super Sexy: That isn't a link.]

Your interpretation makes even less sense to me. By "invalidated", I mean that the statistical test of the hypothesis is made invalid. If that's not what you are saying, all well and good.
I was thinking that 'don't understand' and 'doesn't make sense' are much the same thing.

I like how the authors put it - that the claims aren't necessarily wrong, but rather more uncertain and fragile than the p-values would suggest.

Linda
 
I was thinking that 'don't understand' and 'doesn't make sense' are much the same thing.

The first implies the second, but not vice versa. Think about someone saying "Two plus two equals five."
 
The issue isn't whether the label "exploratory" is valid, but rather whether the reported p-values are. If the scientific hypothesis is not sufficiently specific, then it admits multiple statistical hypotheses, each of which, if "significant" could be claimed as a success for the scientific hypothesis.

For example, consider Bem's Experiment 1. Bem writes, "[T]he main psi hypothesis was that participants would be able to identify the position of the hidden erotic pictures significantly more often than chance (50%)," but that "the hit rate on erotic trials can also be compared with the hit rates on the nonerotic trials...," and indeed he reported tests of both hypotheses. Both turned out to be statistically significant, but it could have turned out otherwise. Suppose only the "main" hypothesis was confirmed; then the experiment would have been a success. But suppose only the (supposedly) secondary hypothesis was confirmed. Then that would have been a success for the experiment, too. Thus Bem had at least two ways to claim success for Experiment 1, which renders his p-values uninterpretable.

But Bem had even more ways to win in Experiment 1. Bem writes, "In our first retroactive experiment [Experiment 5, oddly enough], women showed psi effects to highly arousing stimuli but men did not. Because this appeared to have arisen from men's lower arousal to such stimuli [wait, what?], we introduced...stronger and more explicit images...for the men [in Experiment 1]. Bem's "main" hypothesis was statistically significant, but it didn't have to be. What if, instead, the result was significant for women, but not for men. Then Experiment 1 would be a successful replication of Experiment 5. Or, what if the result was significant for men, but not for women. Then Bem could claim that the stronger stimuli he introduced for men worked and come up with some hypothesis about why the results for women were non-significant (perhaps they were in the hypothesized direction, but inconveniently "failed to attain" significance).

When an investigator has many statistical analyses he could perform to declare success, the reliability of a declared success is reduced in the absence of a preregistered protocol stating exactly what tests were to be performed. Reading Bem's paper and the critiques of Wagenmakers, Alcock, and others, one gets the impression that Bem had more degrees of freedom than a naked hippie at Woodstock. Furthermore, Bem's paper has too many successful hypothesis tests, given the power of his experiments and the effect sizes he reported. Francis (2012) showed that the probability of obtaining at least as many successful outcomes as Bem reported is 5.8%, using a test known to have a large upward bias (Francis 2013). The results of this test can be interpreted as the probability of an exact replication attempt achieving at least as many successful outcomes as the original set of experiments had. Given the tendency of the test to substantially overestimate this probability, the true probability of replication is probably pretty negligible. Indeed several replication attempts of just one of Bem's experiments—never mind all ten of them—have failed. In order for Bem to have gotten as many successful outcomes as he did, he almost certainly had to have employed questionable research practices such as selective reporting.

References:
Francis (2012): http://drsmorey.org/bibtex/upload/Francis:2012d.pdf
Francis (2013): http://www2.psych.purdue.edu/~gfrancis/Publications/Francis2013b.pdf
 
Last edited:
The issue isn't whether the label "exploratory" is valid, but rather whether the reported p-values are. If the scientific hypothesis is not sufficiently specific, then it admits multiple statistical hypotheses, each of which, if "significant" could be claimed as a success for the scientific hypothesis.

This is a good point which I think gets lost when using Bem as an example, because his experiments were intentionally non-specific. So then the argument naturally turns to his intentions, with one side credulous and the other not. But the issues raised by Francis, Gelman, Wagemakers and others apply even if the "not sufficiently specific" is unintentional or unavoidable. The problem isn't solved by asking whether the experimenter's hypothesis was sufficiently specific. It is solved by asking whether the experiment is sufficiently specific.

Reading Bem's paper and the critiques of Wagenmakers, Alcock, and others, one gets the impression that Bem had more degrees of freedom than a naked hippie at Woodstock.

How long have you been waiting to use that line? :D

Linda
 
The issue isn't whether the label "exploratory" is valid, but rather whether the reported p-values are.

Yes, I agree. This is really what I was getting at in the first post when I said the suggestion was that the hypotheses/experimental procedure had been modified "so that the statistical significance of the results can't be taken at face value". That was the sense of "exploratory" I was wanting to discuss.

For example, consider Bem's Experiment 1. Bem writes, "[T]he main psi hypothesis was that participants would be able to identify the position of the hidden erotic pictures significantly more often than chance (50%)," but that "the hit rate on erotic trials can also be compared with the hit rates on the nonerotic trials...," and indeed he reported tests of both hypotheses.

And in fact he doesn't actually state a hypothesis for those additional comparisons, so in a sense we are left guessing what the other psi hypotheses might be, in addition to the "main" one.

But Bem had even more ways to win in Experiment 1. Bem writes, "In our first retroactive experiment [Experiment 5, oddly enough], women showed psi effects to highly arousing stimuli but men did not. Because this appeared to have arisen from men's lower arousal to such stimuli [wait, what?], we introduced...stronger and more explicit images...for the men [in Experiment 1]. Bem's "main" hypothesis was statistically significant, but it didn't have to be. What if, instead, the result was significant for women, but not for men. Then Experiment 1 would be a successful replication of Experiment 5. Or, what if the result was significant for men, but not for women. Then Bem could claim that the stronger stimuli he introduced for men worked and come up with some hypothesis about why the results for women were non-significant (perhaps they were in the hypothesized direction, but inconveniently "failed to attain" significance).

But I don't think that is a fair criticism. If there were a predetermined hypothesis, then the p-value for that hypothesis would be valid, whatever additional exploratory comparisons were made using the same data. And every experiment conducted using both male and female subjects has the potential for an additional exploratory male-female comparison. That doesn't mean that the p-values obtained in all such experiments are invalid.
 
Back
Top