July 4th was an amazing day for science: a candidate for the realization of a field (well, the quantum fluctuation of it) – known as the Higgs field – could be detected: the Higgs boson. The importance of the Higgs field is enormous: it is the responsible of giving mass to everything we see around! From galaxies, stars, planets, us, atoms, etc.
If you are really interested on the physics of this mechanism, you can search around the CERN official webpage for info. I’m not talking here about the details of the physics (which, to be sincere, I don’t entirely understand) of the mechanism and the interpretations, but a little about the details of the detection and a common misinterpretation of it that I’ve seen (and heard) around.
The thing is that physicists at CERN don’t actually “see” the Higgs boson; they see the decays of it (i.e., the transformation of the energy of the boson into other different particles). Almost all particles decay in some form (and most of them very rapidly, like the Higgs!). Perhaps the most common decay is the one of Carbon-14, which allows us to estimate the age of different materials that contain carbon trough radicarbon dating. As Carbon-14, the Higgs decays in various forms but the most important ones that where detected at CERN where two: it’s decay into two photons and it’s decay into four leptons.
The analysis from there is “simple” (I’m obviously skipping some details, but this is just to have a big picture of the process). Protons collide at the LHC, and the resultant energy transforms into different particles. So if the Higgs is among those particles, it has to decay into different other particles; let’s analyse the photon-photon decay. From all the Higgs bosons that appear at the collision, a number of them decay into two photons, where the sum of the energy of the two has a definite value: the one that corresponds to the Higgs mass (energy conservation!). So in every multiple collision at the LHC, you take all photons and you pair them up. For example, if you see 3 photons, you sum the energies of the first to the second and have a pair, the second to the third and have another pair, etc. The people at CMS did this and found the following plot:
Here the y-axis is the number of (weighted) events (i.e. a measure of how many pairs with a given energy you saw), and the x-axis shows the energy of the sum of the photons (a direct measure of the mass of the particle). The dotted line is the model without the Higgs boson and the red line is the model with the Higgs boson: apparently the model with the Higgs fits pretty well!
So far so good. However, here comes the misinterpretation of the results: from all of the above, physicists calculate what is called a (local) p-value, to measure the significance of the “bump” that we’ve seen on the plot above. For anyone who hasn’t taken any statistic courses, the p-value is the the probability of observing data at least as extreme as that observed, given that the null hypothesis is true. I must be clear with this definition, and to be clear, what better than an example? Imagine I have a coin and I throw it twenty times, with the following results (H: heads, T: tails):
T H H H H H H H H H H H H T H T H H H H.
I now ask you: do you think the coin is unbiased? Your answer would probably be no (there are clearly more heads than tails!). Now I ask you again…why? You might reply that, given that the coin is unbiased, the results that I showed to you are very unlikely, so you’ll probably reject the hypothesis that the coin is unbiased. The p-value serves for doing this in a more “orthodox” way: you choose a null hypothesis, (in this case : “the coin is unbiased”), an alternative hypothesis, (in this case : “the coin is biased”), and you calculate the probability of observing data at least as extreme as that observed, given that the null hypothesis is true (in our case the probability of observing data as least as extreme as that observed, “given that the coin is unbiased”). If this probability is low enough  (i.e., if given the null hypothesis : “the coin is unbiased”, the probability of observing data at least as extreme as mine is apparently low, which is our case, i.e., it is unlikely to observe our data or more extreme combinations of head and tails given that the coin is unbiased), then you usually reject the null hypothesis (which was our case: we rejected the hypothesis : “the coin is unbiased”). If the probability is not low enough, you usually can’t say anything, so you fail to reject the null hypothesis. I’ll warn you to be very careful with these statements: rejecting the hypothesis is not the same as proving it wrong. The results of the coin that I showed could come in fact from an unbiased coin, however, among all the possible combinations of heads and tails after twenty throws, it is very unlikely to obtain the data that we saw: this is what a significance test helps you decide. In the same way, failing to reject the null hypothesis is not the same as proving the null hypothesis.
I’m going to be pretty hard on the above definition: the p-value IS NOT the probability of the coin being unbiased. Please save this in your mind:
Let’s calculate the p-value for the coin example, just to be clear. Recall that the probability of obtaining successes (say, heads) in tosses of a coin, where the probability of heads and tails is the same (i.e., ), is given by
The p-value is not the same as just the probability of obtaining the 17 heads that I showed to you on the example. It’s also the probability of any event at least as extreme as that, i.e., obtaining 18 heads, 19, etc. However, we must multiply this by two to account for extreme values at the other side of the distribution  (i.e., obtaining 3 heads, 2 heads, etc.) because those events are at least as extreme as ours too! To be pretty clear on this point, a summary of the values that I’m going to sum are given in the following plot (in red):
In our case, then, the p-value is, mathematically:
which is, in fact, pretty low (a p-value below 0.01 is usually considered “significant at the 99% level”), i.e., it is very unlikely to obtain the data that we obtained or more extreme values of it, given that the null hypothesis is true (given that the coin is unbiased). Again, this is not the probability of the coin being unbiased, it is just the probability of obtaining a result at least as extreme as ours given that the coin is unbiased (given ). With all this in hand, let’s take a look at a plot that was released by the ATLAS team yesterday:
Here the y-axis is the (local) p-value which, from what I’ve read, is the p-value where the null hypothesis is “the signal is random background noise” and the x-axis is the mass of the Higgs boson. I have to state that this is my interpretation of what I’ve read, so I might be wrong here. Anyways, let’s continue. There’s a pretty well defined bump at 126 GeV, suggesting that the mass of this new discovered particle is 126 GeV. If mine is the correct interpretation of the p-value that is shown here, I’m not entirely sure if this is the correct way of presenting the analysis. First of all, comparing p-values is dangerous: given that the null hypothesis is false, the p-value has a defined distribution which is hopefully skewed towards zero (yes, the p-value IS A RANDOM VARIABLE). On the other hand, given that the null hypothesis is true, the p-value has a uniform distribution between 0 and 1 (by the probability integral transform). Given all this, I really don’t know if this is a meaningful plot at all.
Again, if my interpretation of the p-value that appears here is correct, I have something to say. I’ve been hearing lately (even from physicists) that the p-value that appears here is “the probability of the signal being random noise”. However, if my interpretation is correct, this is FALSE. The p-value is the probability of obtaining data as least as extreme as CERN’s given that the signal is random noise (i.e., here : “the signal is background random noise”). Stating that this is the probability of the signal being random noise is as stating that the p-value in our coin example is the probability of the coin being unbiased. That’s nonsense.
I also read from the official ATLAS release that the p-value plot shows “the probability of background to produce a signal-like excess”. If my interpretation of the local p-value is correct, then this is also wrong: the p-value shown means that, in a universe without the Higgs, one experiment in three million would see data at least as extreme as ours (which, as we saw in our coin example, is a little more complicated than just the probability of the background producing the signal of the Higgs: in the case of gaussian errors, the p-value would have been probably the integral of a chi-squared distribution or even the integral of a gaussian…it’s value is even more complicated to interpret and, hence, to compare!). Even more, because the p-value has a defined distribution, this is just an estimation of the possible values of the p-value; if we perform the experiment once again, I can guarantee you that the p-value will be different (and, depending on the distribution of the p-value given that the null hypothesis is false, it could be even closer to 0 than the one reported by the ATLAS team).
My conclusion is that significance testing wasn’t appropiate for a huge result like the one we saw, because it is not only confusing for the people that read the results, but even to physicists around the world trying to explain the results. If you wanted to talk about probabilities of hypothesis or of parameters, why not use, for example, bayesian data analysis or information theoretic criteria for model selection?
: How low enough is enough? Well, usually a probability below 0.01 is good enough (which is called a 99% significance level), but it really depends on the subject of study. In fact, depending on the paradigm, this defines the probability of a Type I error, i.e., the probability of rejecting the null hypothesis when it is in fact true.
: This is usually called a “two-sided p-value”. This p-value has some problems in the case of non-symmetric distributions, but that’s another story.
: This may sound strange for most people, but here’s the catch: the general idea is to always try to avoid Type I errors, i.e., rejecting the null hypothesis when it is in fact true. Think of it as a criminal trial: we want to avoid proving an innocent man guilty (in that example, : “The man is innocent”).