C

I think Chris is making a point here. I have no idea how we can calculate the probability, but we cannot completely dismiss the possibility of fraud.

C

Come on, Jay. Be serious. Just how many of the "thousands of possible analyses" do you think Bem would have needed to have tried to have finally chosen, ever so "subtly" *based on his actual data*, the most successful one(s), and after how many trials of possible analyses *based on his actual data* would it have become utterly impossible to have even pretended to *himself* to be doing anything other than blatantly, deliberately p-hacking?

Hold on. Who, except Chris, has suggested that Bem did "thousands of possible analyses"?

No, I did not say that Bem conducted thousands of tests; I said that he made choices among thousands of possible analyses.

First something basic:

Suppose there are 20 experiments testing whether men or women are better at something. We assume that there is no difference in reality and that we get results that are completely representative of the statistical expectation.

Now everyone tests if the women scored better than the men. We get an average of 1 in 20 statistically significant results, so here we say we find 1.

But now everyone does a second test: Whether the men scored better than the women and, of course, we get another significant result.

Now we have 2 significant results. If we ignore the problem of multiple testing, then we would think that there are 19+19= 38 unreported results in the file-drawer if there is no effect.

However, we know, that this is only because of the multiple testing and that in reality there are only 18 unreported results.

Now suppose that everyone is honest and reports everything.

So we get one significant result again. And since there is no second test there is nothing more to report.

Meanwhile the second significant result gets reported honestly. There were two tests and in such a way that the second result is only "marginally significant at the 10% level".

This would lead you to suppose that the file-drawer should contain 19+9=28 studies if there is no effect.

But wait... We know the file-drawer contains only 18 studies. What went wrong here?

The answer is that you need to correct for every test that**could** have been done and not just those that were actually done.

Let's get back to Bem.

You point to the 1 in 500 value. But that itself is one p-value from 9 experiments.. That means that if 500/9=56 independent tests were done per experiment, then you would expect one such value. Suppose that Bem had a bit of a file-drawer and only reported 9 of 20 experiments, then you'd only expect 25 independent tests per experiment.

This 1 in 500 figure melted down fast, didn't it?

**The question then becomes if he had the opportunity to do about 50 independent tests per experiment**.

Suppose there are 20 experiments testing whether men or women are better at something. We assume that there is no difference in reality and that we get results that are completely representative of the statistical expectation.

Now everyone tests if the women scored better than the men. We get an average of 1 in 20 statistically significant results, so here we say we find 1.

But now everyone does a second test: Whether the men scored better than the women and, of course, we get another significant result.

Now we have 2 significant results. If we ignore the problem of multiple testing, then we would think that there are 19+19= 38 unreported results in the file-drawer if there is no effect.

However, we know, that this is only because of the multiple testing and that in reality there are only 18 unreported results.

Now suppose that everyone is honest and reports everything.

So we get one significant result again. And since there is no second test there is nothing more to report.

Meanwhile the second significant result gets reported honestly. There were two tests and in such a way that the second result is only "marginally significant at the 10% level".

This would lead you to suppose that the file-drawer should contain 19+9=28 studies if there is no effect.

But wait... We know the file-drawer contains only 18 studies. What went wrong here?

The answer is that you need to correct for every test that

Let's get back to Bem.

You point to the 1 in 500 value. But that itself is one p-value from 9 experiments.. That means that if 500/9=56 independent tests were done per experiment, then you would expect one such value. Suppose that Bem had a bit of a file-drawer and only reported 9 of 20 experiments, then you'd only expect 25 independent tests per experiment.

This 1 in 500 figure melted down fast, didn't it?

Mind that 50 is not an estimate. If he had done 50 independent tests per experiment, the p-values should look quite different.

Next you need to know that a scientific hypothesis is not the same as a statistical hypothesis. A scientific hypothesis needs to be translated into a statistical model for testing. There will almost certainly be numerous statistical models that correspond to one single scientific hypothesis. That means you can do several statistical tests for every hypothesis you test.

For example, Bem's choice of combining experiments: That is equivalent to doing multiple tests in that it will give a misleading p-value.

Next, is that you don't need to actually do a statistical test to do a statistical test. By which I mean that you don't need to fire up your stats software. You can just notice some pattern. Any idiot can detect a pretty pattern in the clouds. That corresponds to doing multiple statistical tests. Any scientist will know the data he is gathering and it will be hard for him not to chose a test corresponding to some pattern he has noticed. That's what Jay calls "data-driven choices".

Remember that Bem took years doing these experiments. It's not like he would have said down for a week with the data and done some hardcore number crunching.

I don't see any point in speculating about Bem's state of mind or intentions but I think it's credible that he was unaware of his bad practices.

No, I did not say that Bem conducted thousands of tests; I said that he made choices among thousands of possible analyses. I think those choices were subtly data driven, and I think that Bem could convince himself that the choices he ultimately made were the choices that he would have made all along.

Simmons et. al. showed small amounts of flexibility have a large effect on the production of significant findings - you only need one extra outcome variable, a plan to perform an additional experimental series (if necessary), one modifier and three conditions (added or dropped, as necessary) in order to boost your production of false positives to over 80%.

http://www.researchgate.net/profile...ignificant/links/09e4150f5ccd74c12e000000.pdf

Bem exceeds that minimal degree of flexibility. In precognitive habituation, Bem had three outcomes (mere exposure, habituation and boredom), six conditions, more than five additional experimental series, and at least five modifiers. And nobody batted an eyelash over the multiple analyses that the production of his significant findings would entail, in this case. And this is only considering those analyses he made explicit. We don't even need to consider that he performed any additional analyses, but elected not to mention their results.

You may get more traction if, instead of focussing on just how many potential analyses were available, you focused on how few potential analyses it takes to produce a considerable excess of positive results.

It's interesting to see the second part (where everyone convinces themselves that the 'significant findings' represent the obvious hypothesis all along) play out. The purported psychological effect which Bem intends to test, "mere exposure", would be hypothesized to produce an increased hit rate across the board. This was contradicted by the experimental results. So then we have "habituation" which explains the findings which did not support the "mere exposure" hypothesis. And then when the "habituation" results aren't supported, we have "aversion" and "boredom". If the hypothesis was so obvious, why did Bem explicitly test numerous other hypotheses along the way? Would he really have regarded an increased hit rate in all groups as a failure of his hypothesis?

Linda

Key quote (emphasis mine):

In other words, he clearly distinguishes between exploratory and confirmatory techniques, and the advice he gave which Jay finds so questionable was applicable to the former, not the latter. He is*not* suggesting that in the context of a confirmatory study, one could retrofit one's hypothesis.

In other words, he clearly distinguishes between exploratory and confirmatory techniques, and the advice he gave which Jay finds so questionable was applicable to the former, not the latter. He is

When you are through exploring, you may conclude that the data are not strong enough to justify your new insights formally, but at least you are now ready to design the "right" study. If you still plan to report the current data, you **may **wish to mention the new insights tentatively, stating honestly that they remain to be tested adequately. Alternatively, the data may be strong enough to justify** recentering your article around the new findings and subordinating or even ignoring your original hypotheses.**

Come on, Jay. Be serious. Just how many of the "thousands of possible analyses" do you think Bem would have needed to have tried to have finally chosen, ever so "subtly" *based on his actual data*, the most successful one(s)...

I guess you and Chris don't really get it.

Ludicrous. No one in their right mind could honestly think that Bem believes it would be acceptable to search through thousands of hypotheses in order to find a significant result, and then pretend that the experiment had been designed to test that hypothesis.

At this point I have to agree with Diatom, no one but you and Laird believes that if Bem engaged in p-hacking that he committed fraud. That's your accusation (albeit a contingent one). In two systematic analyses, over 80% of multi-experiment experimental psychology papers were found to exhibit evidence of p-hacking. The practice is endemic in the field (and probably in other fields as well). No one is yelling "fraud."

Last edited:

I suspect that what gets lost, because there were thousands of possible analyses, is that it doesn't take thousands of analyses to come up with significant findings.

Simmons et. al. showed small amounts of flexibility have a large effect on the production of significant findings - you only need one extra outcome variable, a plan to perform an additional experimental series (if necessary), one modifier and three conditions (added or dropped, as necessary) in order to boost your production of false positives to over 80%.

http://www.researchgate.net/profile...ignificant/links/09e4150f5ccd74c12e000000.pdf

Bem exceeds that minimal degree of flexibility. In precognitive habituation, Bem had three outcomes (mere exposure, habituation and boredom), six conditions, more than five additional experimental series, and at least five modifiers. And nobody batted an eyelash over the multiple analyses that the production of his significant findings would entail, in this case. And this is only considering those analyses he made explicit. We don't even need to consider that he performed any additional analyses, but elected not to mention their results.

You may get more traction if, instead of focussing on just how many potential analyses were available, you focused on how few potential analyses it takes to produce a considerable excess of positive results.

It's interesting to see the second part (where everyone convinces themselves that the 'significant findings' represent the obvious hypothesis all along) play out. The purported psychological effect which Bem intends to test, "mere exposure", would be hypothesized to produce an increased hit rate across the board. This was contradicted by the experimental results. So then we have "habituation" which explains the findings which did not support the "mere exposure" hypothesis. And then when the "habituation" results aren't supported, we have "aversion" and "boredom". If the hypothesis was so obvious, why did Bem explicitly test numerous other hypotheses along the way? Would he really have regarded an increased hit rate in all groups as a failure of his hypothesis?

Linda

Simmons et. al. showed small amounts of flexibility have a large effect on the production of significant findings - you only need one extra outcome variable, a plan to perform an additional experimental series (if necessary), one modifier and three conditions (added or dropped, as necessary) in order to boost your production of false positives to over 80%.

http://www.researchgate.net/profile...ignificant/links/09e4150f5ccd74c12e000000.pdf

Bem exceeds that minimal degree of flexibility. In precognitive habituation, Bem had three outcomes (mere exposure, habituation and boredom), six conditions, more than five additional experimental series, and at least five modifiers. And nobody batted an eyelash over the multiple analyses that the production of his significant findings would entail, in this case. And this is only considering those analyses he made explicit. We don't even need to consider that he performed any additional analyses, but elected not to mention their results.

You may get more traction if, instead of focussing on just how many potential analyses were available, you focused on how few potential analyses it takes to produce a considerable excess of positive results.

It's interesting to see the second part (where everyone convinces themselves that the 'significant findings' represent the obvious hypothesis all along) play out. The purported psychological effect which Bem intends to test, "mere exposure", would be hypothesized to produce an increased hit rate across the board. This was contradicted by the experimental results. So then we have "habituation" which explains the findings which did not support the "mere exposure" hypothesis. And then when the "habituation" results aren't supported, we have "aversion" and "boredom". If the hypothesis was so obvious, why did Bem explicitly test numerous other hypotheses along the way? Would he really have regarded an increased hit rate in all groups as a failure of his hypothesis?

Linda

Hold on. Who, except Chris, has suggested that Bem did "thousands of possible analyses"?

Chris hasn't suggested that, as far as I'm aware. I was quoting Jay directly, from the post to which I was responding:

Oh, all right. Should have read more carefully.

Last edited:

C

Simmons et. al. showed small amounts of flexibility have a large effect on the production of significant findings - you only need one extra outcome variable, a plan to perform an additional experimental series (if necessary), one modifier and three conditions (added or dropped, as necessary) in order to boost your production of false positives to over 80%.

http://www.researchgate.net/profile...ignificant/links/09e4150f5ccd74c12e000000.pdf

Bem exceeds that minimal degree of flexibility. In precognitive habituation, Bem had three outcomes (mere exposure, habituation and boredom), six conditions, more than five additional experimental series, and at least five modifiers. And nobody batted an eyelash over the multiple analyses that the production of his significant findings would entail, in this case. And this is only considering those analyses he made explicit. We don't even need to consider that he performed any additional analyses, but elected not to mention their results.

http://www.researchgate.net/profile...ignificant/links/09e4150f5ccd74c12e000000.pdf

Bem exceeds that minimal degree of flexibility. In precognitive habituation, Bem had three outcomes (mere exposure, habituation and boredom), six conditions, more than five additional experimental series, and at least five modifiers. And nobody batted an eyelash over the multiple analyses that the production of his significant findings would entail, in this case. And this is only considering those analyses he made explicit. We don't even need to consider that he performed any additional analyses, but elected not to mention their results.