## What’s Wrong With P-Values?

One of my favourite betes noires claims to have put everything wrong with P-Values under one roof.

My response started with  “There’s nothing wrong with p-values any more than with Popeye. They is what they is and that’s that. To blame them for their own abuse is just a pale version of blaming any other victim.”

Briggs replied saying “This odd because there are several proofs showing there just are many things wrong with them. Particularly that their use is always fallacious.” which is odd itself as it seems to be just a reworking of exactly what I said, namely that what is “wrong” with them is just the (allegedly) fallacious uses that are made of them.

My comment continued with the following example:

“But if you are the kind of pervert who really enjoys abuse here goes:
Let H0 be the claim that z=N(0,1) and let r=1/z.
Then P(|r|>20)=P(|z|<.05)=approx.04<.05
So if z is within .05 of 0 then the p-value for r is less than .05 and so at the 95% confidence level we must reject the hypothesis that mean(z)=0.

Now the joke here is really based on Briggs mis-statement of what a p-value is. Not that there would be anything wrong with the thing he defined but it just wouldn’t be properly called a p-value. And in order to criticize something (or even just the use of that thing) you need to know what it actually is. So for the enlightenment of Mr Briggs, let me explore what a p-value actually is.

What Briggs defined as a p-value is as follows: “Given the model used and the test statistic dependent on that model and given the data seen and assuming the null hypothesis (tied to a parameter) is true, the p-value is the probability of seeing a test statistic larger (in absolute value) than the one actually seen if the experiment which generated the data were run an indefinite number of future times and where the milieu of the experiment is precisely the same except where it is “randomly” different.” This has a number of oddities (excessive and redundant uses of the word “given” and the inclusion of an inappropriate repetition condition being among them) but the most significant thing wrong with it is that it only applies to certain kinds of test statistic – as demonstrated by my silly example above.

A better definition might be: Given a stochastic model (which we call the null hypothesis) and a test statistic defined in terms of that model, the p-value of an observed value of that statistic is the probability in the model of having a value of the statistic which is further from the predicted mean than the observed value.

With this definition, it becomes clear that if the null hypothesis is true (ie if the model does accurately predict probabilities) then the occurrence of a low P-value implies the occurrence of an improbable event and so the logical disjunction that Briggs quotes from R A Fisher, namely “Either the null hypothesis is false, or the p-value has attained by chance an exceptionally low value” is indeed correct.

Briggs claim that this is “not a logical disjunction” is of course nonsense (any statement of the form “Either A or B” is a logical disjunction), and this one has the added virtue of being true. Of course  if the observed statistic has a low p-value then the disjunction is essentially tautological, but then  really so is anything else that we can be convinced of by logic.

But Briggs is right to wonder if it has any significance – or at least, if it does then what is the reason for that.

Why do we some people consider the occurrence of a low p-value to be significant (in the common language sense rather than just by definition)? In other words, why and how should it reduce our faith in the null hypothesis?

The first thing to note is that the disjunction  “Either the null hypothesis is false, or something very improbable has happened” should NOT actually do anything to reduce our faith in the null hypothesis. It certainly matters what kind of improbable thing we have seen happen.  For example a meteor strike destroying New York should not cause us to doubt the hypothesis that sex and gender are not correlated – so clearly the improbable observed thing must be something that is predicted to be improbable by the null hypothesis model.  But in fact, in any model with continuously distributed variables the occurrence of ANY particular exact observed value is an event of zero probability. One might hope to talk in such cases of the probability density instead, but the probability density can be changed just by re-scaling the variable, so that won’t do either.

What is it about the special case of a low p-value, ie an improbably large deviation from the expected value of a variable, that reduces our faith in the null hypothesis?

…to be continued

### 12 Responses to “What’s Wrong With P-Values?”

1. Nullius in Verba says:

“so clearly the improbable observed thing must be something that is predicted to be improbable by the null hypothesis model.”

You need both a null hypothesis and an alternative hypothesis, and the observation must have *different* predicted probabilities under the two hypotheses. It’s the ratio of the probabilities that constitutes evidence (in favour of one hypothesis over the other hypothesis).

It’s not enough that the observation has a *low* probability under the null hypothesis, it must also have a *high* probability under the *alternative* hypothesis.

But see my ‘7 October 2013 at 3:27 pm’ comment at Briggs’ place for the maths.

2. alan says:

Yes Nullius, I agree that just seeing “something that is predicted to be improbable by the null hypothesis model” is not sufficient to reject the hypothesis (the example of any exact value in a continuum being a case in point).

For now (despite the intent of your name) I’ll take your word on how to use p-values to assess relative likelihoods and that that requires two hypotheses (or perhaps a continuum of them labelled by a parameter in the case of maximum likelihood estimation).

But often low p-values seem to be used to “reject” a “null hypothesis” even when there is no explicit alternative provided and I was looking for an explanation of how that might make sense.

In my following post I try to provide one – namely that if our rule is to pick *any* specific event to which H0 assigns low probability and reject H0 only if that specific event occurs then our rejection will only be wrong with that low probability. This sounds kind of silly, and it would be if we picked our test event at random (because then we’d just have a very low probability of drawing any conclusion). But we might be able to choose a more productive type of test event if we have a strong suspicion of the way in which H0 might be wrong (but without having a specific alternative in mind).

3. Nullius in Verba says:

The reason for using two hypotheses is that there are certain quantities that are very difficult to estimate, and using two allows us to eliminate them from the equations.

Bayes’ theorem says:
P(H|Obs) = P(Obs|H)P(H)/P(Obs)

we write this twice, once for each hypothesis, and then divide:
P(H0|Obs)/(P(H1|Obs) = (P(Obs|H0)/P(Obs|H1)) * (P(H0)/P(H1))

Then take logs. The rest is just a matter of how you interpret it.

There’s always an alternative hypothesis, even if it’s just that H0 is false, with probability P(H1) = 1 – P(H0).

I agree that it is common to see people “rejecting” a null hypothesis on the basis of a low p-value, but you can’t conclude from that that they are right to do so. Often they are, because the other required conditions happen to be true, and the author has usually checked them implicitly by making sure it makes intuitive sense. (For example, in your meteor strike example you realised that the conclusion didn’t follow, even though the p-value was very low. Your intuition understands, even if you can’t put it into words. The asteroid strike tells you nothing because it is *equally* improbable whether sex and gender are correlated or not.

However, Briggs is correct to say that taken literally, the p-value procedure in the absence of the other conditions makes no sense. He goes crazy over the top in saying that this means they are always and entirely useless, but the less extreme point is well-known in statistics. People, even scientists, commonly misunderstand what p-values actually mean.

If we only required that we should wrongly reject with a very low probability, the easiest procedure is to never reject. Then the probability of a wrong rejection is zero. We can do better than that, though. The Bayesian formulation both gives us a more reliable procedure, and explains both when and why the simpler p-value procedure will work.

4. alan says:

Yes, the use of Bayes’ Theorem can be helpful in cases where it makes sense to talk of the hypothesized probability models themselves having probabilities of being true. But sometimes that strikes me as a bit self-referential, and so possibly liable to generate Russell-type paradoxes.

And I also suspect that the resulting probability statements, even from correctly applied analysis, are just as capable as the old ones of being breathlessly expressed in ways that will mislead an uninformed public.

As you say: “The rest is just a matter of how you interpret it.”
Aye, There’s the rub!

5. Nullius in Verba says:

Bayes theorem itself is just an obvious statement about measures (areas, volumes, lengths, etc.). If you draw a Venn diagram where area is proportional to probability, it becomes obvious. The area of the intersection of two regions is the fraction it makes of either region times the area of that region.

So draw a Venn diagram with areas A and B, that intersect in area AiB. The area of AiB is (AiB/A) times A, and at the same time it is (AiB/B) times B. The fraction AiB/A is analogous to the probability P(B|A), since if a point is selected at random, then the probability it is in B given that it is in A is just AiB/A. Likewise for B.

So AiB = (AiB/A)*A = (AiB/B)*B means (AiB/A) = (AiB/B)*B/A which is the same thing as P(B|A) = P(A|B)P(B)/P(A).

Bayes’ theorem is obvious and trivial, when looked at in the right way. That you can apply it to more complicated stuff like compound probability models (the probability of a probability model being valid) is a separate matter altogether. That’s not generally self-referential as such (unless you think defining integers as the successors to smaller integers is self-referential) but that doesn’t matter. I don’t need that part for where I’m going.

Still, I’m not worried that there are a lot of people who don’t agree. It’s a counter-intuitive topic.

6. alan says:

Yes, Bayes’ theorem itself is proved in any beginning probability course. But, as you say, “That you can apply it to more complicated stuff like compound probability models (the probability of a probability model being valid) is a separate matter altogether.” You also say you “don’t need that part” for where you are going, but it does appear to be used when you interpret (P(Obs|H0)/P(Obs|H1)) as the ratio of P(H0|obs)/P(H1|obs) and (P(H0)/P(H1)). I am not denying that there may be situations in which such an interpretation makes perfectly good sense but I suspect that there are restrictions on that interpretation which are not universally well understood. You may understand them, but not all possible users of the idea will – and this may well lead to abuses of the idea to create an unwarranted sense of certainty about claimed results. (Kind of like the situation with p-values, but I think actually more so.)

7. Nullius in Verba says:

Ah, you mean for the probability of a *hypothesis* being valid. Yes, you do need that bit.

Although if you reject the idea that you can generally assign probabilities to hypotheses, I’m not sure what you expect to achieve by doing experiments.

There are several issues you could be thinking of. There’s the problem of induction (whether we can gain universal knowledge by doing experiments on parts of the world) and the difference between ‘Bayesian probability’ and ‘Bayesian belief’, which I’ve expounded on in the past at Briggs’.

Yes, there are philosophical difficulties, but they’re mostly of the sort you can ignore for everyday use. You were asking for a way of looking at it to show how and why p-values are useful. This one works.

If you’re worried it doesn’t, then perhaps it would be best to do as Briggs suggests and avoid using p-values in future?

8. alan says:

I’m happy to keep on using p-values for what I think they are (as outlined in my next post) – namely a measure of the probability of making a wrong rejection if the hypothesis is true. This avoids the problematic issue of whether it makes sense to talk of the probability of a hypothesis but does not prevent me using the p-value to adjust my own feeling of confidence in that hypothesis. I don’t attempt to quantify my levels of confidence but perhaps what Bayesians call “probability” may be a way of doing that. I will see what I can find out about your distinction between ‘Bayesian probability’ and ‘Bayesian belief’ (and any links you can provide on that would be appreciated).

9. SteveBrooklineMA says:

Suppose we have disjoint events (e1,e2,…,en), and corresponding probabilities for these events (p1,p2,…,pn), where pk=Prob(ek) and p1+p2+…+pn=1. If a particular event ek occurs, what would make us think this was in some sense “unusual” or perhaps “suspicious”? As you point out, it’s not enough that pk be small, since for large n, even a uniform distribution on the en will have pk small. Nor is it enough that pk be much less than max_n pn, since it is possible that all pn are equal except for one event having many times larger yet still tiny probability. It’s not enough if pk is less than nearly all the other pn, because all the pn could be very nearly equal.

What does seem to work in the cases I can think of is to choose some factor R greater than 1, calculate sum({pn: pn>R*pk}), and see if this is close to 1.

To work this into a hypothesis test, we could reject H0 if sum({pn: pn>R*pk})>(1-1/R), though the expression on the right-hand side is rather arbitrary. With this setup, what value should R be? Consider the standard normal and the p1.96)=0.05. Then R=3.71, since 3.71*NrmlPdf(1.96) = NrmlPdf(1.1), and Prob(|sample|>1.1)=1/3.71. If we wanted R=20, we would need to use a cutoff |sample|>3.135749, which corresponds to a very small standard p-value of 0.001714.

Clearly, given any p cutoff we can find a corresponding factor R, and vice-versa. Since the p=.05 rule is arbitrary, I don’t see what difference it makes for the most common cases. Thus, p-value analysis seems generally ok to me in practice.

10. alan says:

Thanks Steve. That looks like an interesting idea.
P.S. I hope it’s ok with you that I have removed the garbled version and your reference to it – and took the liberty of changing “R greater than 11” to “R greater than 1” (which is I think what you meant)

11. That P(z|H0) is small for some set of data on z tells one nothing about a candidate H1s. Worse, for any “acceptable threshold” of plausibility, c, as everyone knows, one can find a real dataset completely consistent with H0 which will have P(z|H0) smaller than c.

12. alan says:

Yes Jan, those claims are both quite true. But I don’t see them as anything “wrong with” p-values per se (though they may well underlie what is wrong with some people’s uses of p-values). If you want to see what I think p-values may be *good* for then please check out my subsequent post.