To follow up on my previous post: since the occurrence of an event having low probability according to some model is not always reason to doubt the model, it becomes natural to ask what would be reason to doubt the model. And for this, some of the negative examples from last time may help to bring things into focus.
In the case of a continuously distributed variable (with absolutely continuous probability density) the probability of any particular value is zero, so whatever we observe is a case of something that had low probability in the model. (And the same applies in the case of a discrete variable with a large number of equally likely values.)
When we throw an icosahedral die, whatever face we see on top has a probability of only 1/20 of being there, but we don’t take that as evidence that the die is not a fair one. However, if someone specific had correctly predicted the result then we might be more suspicious – and that is the key to how p-values work. (By “someone specific” here I mean specified in advance – not the same as having 20 people place bets and being surprised to see one of them get it right.)
Similarly, in my silly example from last time, although a value very close to the predicted mean should not cause us to doubt that that predicted mean is correct, it may well cause us to doubt that the variance is as large as proposed in the model. (And in fact there are several historical examples where data clustered too close to the predicted mean has been taken as convincing evidence of experimental malfeasance.)
So in order to be made suspicious by a low p-value it seems to be important that we know in advance what statistic we will be interested in and what kind of results we will consider significant.
This does not answer the question of exactly what that significance means or how we quantify it, but I think it does suggest that there is a valid intuition behind the idea that seeing something actually occur right after asking whether it will happen makes us doubt the claim that it actually had very low probability.
Now when I buy a lottery ticket I do wonder if I will actually win. So if I do win the jackpot I will be faced with something that to me would be a significant indication that there was more than chance involved. Of course in that case I will probably be wrong, but the probability of my being forced into that error is so small as to be of little worry to me.
Similarly, if I reject a null hypothesis model on seeing a pre-described outcome to which the model assigns a probability of p (whether it’s an extreme value of some statistic or just a value very close to some specific target) then if the hypothesis is actually true I have a probability p of being wrong.
That’s what the p-value really is. It’s the probability that the model predicts for whatever outcome we choose to specify in advance of checking the data. Period. If we decide to reject the model on seeing that outcome then we can expect to be wrong in the fraction p of cases where the model is true.
Of course if we just choose a low probability event at random we probably won’t see it and so will have nothing to conclude, so it is important to pick as our test event something that we suspect will happen more frequently than the model predicts. (This doesn’t require that we necessarily have any specific alternative model in mind, but if we do then there may be more powerful methods of analysis which allow us to determine the relative likelihoods of the various models.)
Note: None of this tells us anything about the “probability that the model is true” or the “probability that our rejection is wrong” after the fact. (Noone but an extremely deluded appropriator of the label “Bayesian” would attempt to assign a probability to something that had already happened or which actually is either true or false.)
To repeat: What the p-value is is the frequency with which you can expect to be wrong (if you reject at level p) in cases where the null hypothesis is true. This is higher than the frequency with which you will actually be wrong among all the times you apply that rule, because the null hypothesis may not actually be true and none of those cases will count against you (since failure to reject something is not the same as actually accepting it and it is never factually wrong to withhold judgement – though I suppose it may often be morally wrong!).
P.S. Significance should not be confused with importance! Anyone who speaks English correctly should understand that significance refers to strength of signification – ie to the relative certainty of a conclusion – not to the importance of that conclusion. So it is possible to have a highly significant indication of a very small or unimportant effect and estimating the “size” of whatever effect is confounding the null hypothesis is something that cannot be done with the p-value alone.
P.P.S. There is of course a significant quite large non-zero probability that a randomly chosen pronouncement from my repertoire may be flawed. So if you find something to object to here you could get lucky.
UPDATE7:45pm Oct22: See the end of my long comment below re the common practice of computing a “p-value” after the fact from an actual observation rather than from a target event specified in advance.
I think major confusion occurs because the p value is defined in terms of the null being true, but people choose horrible null models that are accepted as false by everyone who gives it any thought (0 regression, two means exactly equal)! So your definition is irrelevant to the common use case.
The reason they do this is because they can’t stand to just describe the data until someone comes up with a decent model to test and build upon. Everyone needs to “make decisions” or “draw conclusions” before gathering enough evidence to say anything specific. This is a major departure from strategies that have been successful for science and engineering in the past.
Ask e.g., a medical researcher these days to come up with a theory that can predict a parameter or the form of a mathematical model based on the processes they expect are generating the data and they look at you confused. People are dying because of this confusion.
I think you are absolutely right!
The problem is not with p-values per se though, but rather with the choice of null models, and it would be possible to eliminate the “significant” evidence for trivial effects by defining the null hypothesis in terms of a bound on the correlation or difference of means.
That problem is not with p values per se. There are more though. Here is one doozy: There is no correct way to define a multiple comparison. I will just quote Jerome Cornfield on this one:
“How multiple should the multiple comparison be? Do we want error control over a single trial, over all the independent trials on the same agent, on the same disease, over the lifetime experience of a single investigator, etc.? Does a hypothesis rejected at the .05 level for a single comparison have the same amount of evidence against it as an omnibus hypothesis involving multiple comparisons but also rejected at the .05 level? The unanswerability of these questions suggests that they have been incorrectly posed and that the multiple comparison problem is the symptom and not the disease—the disease being the inappropriateness of the p value as a measure of uncertainty.”
Cornfield, Jerome (1976). “Recent Methodological Contributions to Clinical Trials”. American Journal of Epidemiology 104 (4): 408–421.
http://www.epidemiology.ch/history/PDF%20bg/Cornfield%20J%201976%20recent%20methodological%20contributions.pdf
Again I agree in part. The p-value is NOT a “measure of uncertainty”. In the context I have been discussing it is just what the probability of an observed (previously specified) target event would be if the null hypothesis were true. And so, if you always reject a hypothesis whenever you observe a previously specified event to which that hypothesis assigns probability p, then you can expect to be wrong with probability p if the hypothesis is true (and so with probability less than p in general).
I suspect Cornfield is right that there are better approaches to dealing with multiple relationships and ongoing data collection, but I would quarrel with his claim that with p-values “if you perform enough significance tests you are sure to find significance, even when none exists”. What you will find after doing a lot of experiments is “significant” evidence for something that is NOT true (which you might, with probability p, find even if you do only one test). And there’s nothing wrong with that, so long as one does not confuse the word “significant” with “conclusive”. The p-value then just gives you a bound on how often what you find significant will actually be misleading.
The problem with using p-values for testing multiple relationships in the same experiment is not that it becomes “highly probable that one or more p values less than the conventional .05 or .01 will be obtained”, but rather that what are called “p values” in such a case are NOT actually p-values because the specific extreme observation ranges Oi>bi that are being considered “significant” were not specified in advance. What was specified in advance in such cases is the disjunction over all of the measured variables ( ie O1>b1 OR O2>b2 OR …etc) – which has a much higher probability, and so a higher p-value.
There are perfectly good techniques for “permitting any amount of multiple testing” (even with p-values) so long as the amount of multiplicity is specified in advance. But drawing inferences and making decisions on the basis of data subsets chosen after the fact is not a use of p-values at all. I would not attempt to define a p-value for an event not specified before it was looked for (and by “specified” I don’t mean just “mentioned” but identified as the complete target of investigation). In such cases likelihood ratios and Bayesian methods may well be more useful, but the p-value still just is what it is and it’s the abuse rather than the concept itself which is problematic.
I should probably add something in the OP above about the common practice of computing the p-value in terms of a target set defined in terms of the observed value rather than determining the target set in advance from a desired upper bound on the probability of error. I suspect that Fisher understood the difficulty of interpreting such a number and that this (rather than the stuff about copyrights) is the real reason he went for pre-specification of significance levels at a few arbitrary levels rather than imagining that one might enter a study with a special interest in having less than one in 25 (p=.04) erroneous rejections rather than one in 20 (p=.05).
I suppose the post-facto p-value computed in terms of the observation would have to be interpreted as something like “the lowest probability of erroneous rejection whose acceptance would have led me to reject the null hypothesis on the basis of this observation (and the conventional inequality class of target events)”. I suspect that might even make sense, but it’s far too convoluted for a non-statistician like me to understand (and I haven’t thought at all about how it might extend to the case of multiple variables).
I think it may help to use an example closer to the application Cornfield is thinking of.
Lets say I run an experiment testing drug A. The results do not meet my criteria for publication and p value is .67 or whatever “nonsignificant”.
Then I run an unrelated experiment testing drug B. My p value is .03.
These are two unrelated studies both with the endpoints properly specified beforehand.
According to your interpretation, why should I not “correct” that second p value to .06? The reason I calculated that p value is to make a decision about whether more research should be done on drug B.
The researchers logic: I know that if i test drugs A-Z that it is likely that one will give me a spurious result and waste my time/money, therefore clearly if I test two drugs it would be wise to be more skeptical of any result showing one of them “works”. I cannot know how many drugs I will test beforehand as this depends on outside factors. So it is impossible to prespecify everything I will investigate and calculate an “overall p value” (whatever that is). Still just interpreting the p value as is seems unreliable, but then again the two drugs were totally unrelated to each other.
The precise role of pre-specification in the logic seems nebulous to me.
ok, I am not sure what you mean by “why should I not “correct” that second p value to .06?”
But let me try this.
If you test 26 random useless drugs A-Z with a significance cutoff at p then your probability of seeing something that looks significant in each case is p, so the expected number of “successes” is 26p (and the probability of seeing at least one success is 1-(1-p)^26, which is pretty close to 26p if p is very small and close to 1 if it is not). If I randomly choose one drug, B, and agree to invest in it if it gave you a significant result then my chance of being misled is just p (basically because my random choice is probably NOT one of your successes, so I probably won’t invest- remember, at this point we are assuming all investments are bad because all of the drugs A-Z are actually useless). But if I had agreed to invest on the basis of ANY significant result then I am almost certain to be wasting my money (and absolutely certain if your experiments don’t all fail).
Of course if you are any use as a researcher then you won’t pick all duds, and the sign of your usefulness will be the extent to which your number of “significant” results exceeds the expected number, or in other words, when we go to arbitrary numbers of trials, the extent to which the frequency with which you see significance at cutoff p actually exceeds p. If it doesn’t do so by much, then you are basically just guessing (or very unlucky); and if your frequency is less than p … well that shouldn’t really happen, even if you are only guessing, unless you really are very unlucky (or someone is interfering with your results).
If you really do just one study, and it produces a significant p-value, then your success rate of 100% certainly beats p and so it is worth taking notice. But if you continue and do more studies, then it is important for me to know the number of failures as well as successes, and if your success rate exceeds p then we can start to think that some of the excess successes are due to real effects (but we won’t know which ones so you won’t be much use as a researcher unless the excess is substantial).
What’s really NOT any use is just reporting successes with their associated p-values (the researchers themselves may know how many tries it took, but no-one else does). So a discipline or journal that fails to adopt the protocol of requiring prior registration of ALL statistical studies before they are performed and considered for publication is really in my opinion not being responsible.
“If you really do just one study, and it produces a significant p-value, then your success rate of 100% certainly beats p and so it is worth taking notice. But if you continue and do more studies, then it is important for me to know the number of failures as well as successes, and if your success rate exceeds p then we can start to think that some of the excess successes are due to real effects (but we won’t know which ones so you won’t be much use as a researcher unless the excess is substantial).”
I will focus on this. So what if it is not just me doing studies and not reporting them when the results are “not significant”, but also another investigator who is replicating all of my own work concurrently? Now I also need to adjust my interpretation of a p value for these studies?
It sounds like that is what you are saying, and I would agree. This is what I do informally when information is available. But this line of logic leads us to an absurd conclusion that we should correct for every p value ever calculated. There is no “dividing line” that has been established, and Cornfield is claiming it is impossible to find one.
Thanks for engaging in this discussion; I am finding it really helpful to clarify and extend my understanding of these issues. But before we go on I should repeat the disclaimer that I am not a statistician, and so any of my speculations that are not wrong may be already dealt with and better presented elsewhere.
It seems to me that the “dividing line” can be established by thinking of the “advantage” (actual relative frequency of “significant” results minus p value of cutoff) as a property of a source (eg individual researcher, or institute, or journal). And for that the relevant studies are just all of those ever performed or pre-authorized by that particular source.
It struck me after making my earlier comment that one problem with using this “advantage” as indicative of the quality of the source is that it would be easy to “game the system” by generating a high frequency of “significant” results just by doing lots of “studies” which merely confirm known effects. So perhaps replication studies should not be part of the mix, and the results obtained should somehow be weighted by their degree of unexpectedness. I would not attempt to quantify this myself (though a Bayesian might!), so really I guess it’s just an informal notion as you suggest.
P.S. With regard to the duplication issue, it seems to me that that’s more like just increasing the size of the original study. If you and I both try the same experiment with N subjects or trials for each of us, then couldn’t that just be considered as a single run of 2N trials? [And if the study type leads to a formula for p-value in terms of results and trials as p(r,n), then in order to combine p-values without having the actual data, I guess we’d need a function with the property f(p(r1,n1),p(r2,n2))<p(r1+r2,n1+n2)]
I am no statistician either, just a researcher who came across this post linked from somewhere else and decided to “engage”, because why not?
The truth is I have come to doubt the usefulness of significance testing and more so hypothesis testing altogether for scientific research. It may be useful in theory but in practice there is too much that goes on for very practical reasons that invalidates prespecified experimental designs. There are also huge pieces of the logical basis missing (the multiple comparison issue we are discussing is one example) that seem irrelevant when using simple examples but can make a huge difference in practice.
I will get back to you with a direct response to your last post as I want to dig up some references.
Sorry, got busy. Here is an interesting perspective though. It reduced my negative impression of p values.
http://arxiv.org/pdf/1311.0081v1.pdf
Thanks Fr. That paper looks interesting. I have never taken the time to think enough about likelihood theory to justify making any comment on that part, but I do think Lew overstates the case a bit in his defense of “Fisher’s disjunction” which he puts in the form of a syllogism as follows:
Extreme P-values from random samples are rare under the null hypothesis.
An extreme P-value has been observed.
(Therefore, either a rare event has occurred or the null hypothesis is false.)
Therefore, the null hypothesis is probably false.
He then says “There is nothing wrong with that, although the line in parentheses is not logically necessary”, but I would argue rather that the line in parentheses is the valid conclusion and the last line does NOT logically follow (but is not required in order for the P-value to be useful).
I would also take issue with the earlier claim that “The idea that P-values measure type I error rates is as pervasive as it is erroneous, …” This may be fair enough if we think of a type I error as a “false positive” about some specific alternate hypothesis, but I think not if it is just identified as false rejection of the null hypothesis (at least if one interprets “measure” as “bound”). It is certainly not the type I error rate for a Hypothesis comparison, but it seems to me that it is indeed the type I error rate for the case where the alternate hypothesis is just negation of the null.