To follow up on my previous post: since the occurrence of an event having low probability according to some model is not always reason to doubt the model, it becomes natural to ask what would be reason to doubt the model. And for this, some of the negative examples from last time may help to bring things into focus.
In the case of a continuously distributed variable (with absolutely continuous probability density) the probability of any particular value is zero, so whatever we observe is a case of something that had low probability in the model. (And the same applies in the case of a discrete variable with a large number of equally likely values.)
When we throw an icosahedral die, whatever face we see on top has a probability of only 1/20 of being there, but we don’t take that as evidence that the die is not a fair one. However, if someone specific had correctly predicted the result then we might be more suspicious – and that is the key to how p-values work. (By “someone specific” here I mean specified in advance – not the same as having 20 people place bets and being surprised to see one of them get it right.)
Similarly, in my silly example from last time, although a value very close to the predicted mean should not cause us to doubt that that predicted mean is correct, it may well cause us to doubt that the variance is as large as proposed in the model. (And in fact there are several historical examples where data clustered too close to the predicted mean has been taken as convincing evidence of experimental malfeasance.)
So in order to be made suspicious by a low p-value it seems to be important that we know in advance what statistic we will be interested in and what kind of results we will consider significant.
This does not answer the question of exactly what that significance means or how we quantify it, but I think it does suggest that there is a valid intuition behind the idea that seeing something actually occur right after asking whether it will happen makes us doubt the claim that it actually had very low probability.
Now when I buy a lottery ticket I do wonder if I will actually win. So if I do win the jackpot I will be faced with something that to me would be a significant indication that there was more than chance involved. Of course in that case I will probably be wrong, but the probability of my being forced into that error is so small as to be of little worry to me.
Similarly, if I reject a null hypothesis model on seeing a pre-described outcome to which the model assigns a probability of p (whether it’s an extreme value of some statistic or just a value very close to some specific target) then if the hypothesis is actually true I have a probability p of being wrong.
That’s what the p-value really is. It’s the probability that the model predicts for whatever outcome we choose to specify in advance of checking the data. Period. If we decide to reject the model on seeing that outcome then we can expect to be wrong in the fraction p of cases where the model is true.
Of course if we just choose a low probability event at random we probably won’t see it and so will have nothing to conclude, so it is important to pick as our test event something that we suspect will happen more frequently than the model predicts. (This doesn’t require that we necessarily have any specific alternative model in mind, but if we do then there may be more powerful methods of analysis which allow us to determine the relative likelihoods of the various models.)
Note: None of this tells us anything about the “probability that the model is true” or the “probability that our rejection is wrong” after the fact. (Noone but an extremely deluded appropriator of the label “Bayesian” would attempt to assign a probability to something that had already happened or which actually is either true or false.)
To repeat: What the p-value is is the frequency with which you can expect to be wrong (if you reject at level p) in cases where the null hypothesis is true. This is higher than the frequency with which you will actually be wrong among all the times you apply that rule, because the null hypothesis may not actually be true and none of those cases will count against you (since failure to reject something is not the same as actually accepting it and it is never factually wrong to withhold judgement – though I suppose it may often be morally wrong!).
P.S. Significance should not be confused with importance! Anyone who speaks English correctly should understand that significance refers to strength of signification – ie to the relative certainty of a conclusion – not to the importance of that conclusion. So it is possible to have a highly significant indication of a very small or unimportant effect and estimating the “size” of whatever effect is confounding the null hypothesis is something that cannot be done with the p-value alone.
P.P.S. There is of course a significant quite large non-zero probability that a randomly chosen pronouncement from my repertoire may be flawed. So if you find something to object to here you could get lucky.
UPDATE7:45pm Oct22: See the end of my long comment below re the common practice of computing a “p-value” after the fact from an actual observation rather than from a target event specified in advance.