A python with a long memory: fitting 1/f noise with PyMC

Introduction

Since ancient times, astronomy has been observing long-memory processes (i.e., short frequency noise) all over the place. The problem has been a tough one for lots of people working on time-series, specially for the ones involved in exoplanets, as the lightcurves (which show an apparent decrease in the observed flux of the star as the result of the planet eclipsing the star) usually show signs of red noise which can’t be neglected, as they affect the physical properties of the planet and the star derived from these measurements (Pont, Zucker & Queloz, 2006).

A special case of these kinds of long-memory processes, flicker (or 1/f, or pink) noise, has attracted the attention of the community in general for more than forty years (see, e.g., Press, 1978) but, until recent years, little work has been done to account for that kind of noise, despite the fact that approximate solutions for the problem have been around since almost twenty years (Wornell, 1995). However, a few years ago, Carter & Winn (2009) published a very interesting paper in which they derive and implement a very interesting methodology following the approaches given in Wornell (1995), in which a wavelet-based approach is used to transform the time series. The trick is that in wavelet space, 1/f processes have nearly diagonal covariance matrices, so the problem is extremly simplified when wavelet-transforming the time-series. The implementation assumes a model of the form:

f(t,\theta) + \epsilon_{\textnormal{1/f}}(t)+\epsilon_{\textnormal{wn}}(t),

where the first term is a deterministic model with parameters \theta, the second term is a zero mean 1/f stochastic process (which is defined by only one parameter, \sigma_r, which defines its amplitude but is not the square root of its variance), implying its Power Spectral Density is proportional to 1/f and the third term is a zero mean white noise process, here assumed to be gaussian (and hence with only one parameter defining it, \sigma_w, the square root of its variance). Furthermore, the model assumes the time-sampling is uniform.

The work of Carter & Winn (2009) was implemented in a code called Transit Analysis Package (TAP; Gazack et al., 2012), which was coded in Interactive Data Language (IDL), a very nice coding plataform/language, but also a very expensive one for a poor graduate student like myself1. So I decided to make my own implementation in Python of the work of Carter & Winn (2009), but using PyMC, a really amazing python module that implements Bayesian statistical models and fitting algorithms, which includes Markov chain Monte Carlo. In this blog post I want to make a little tutorial on how to use this implementation; you can download all the codes for your research in my personal webpage [here]. If you use it, please, cite the work of Carter & Winn (2009) and acknowledge the hard work I have put in this implementation.

Installing the package

To install this package, download the code from my personal webpage, and unpack it to a folder. Inside you will find a install.py file that you have to run, and this will create a file called FWT.so inside the WaveletCode folder; this is a Discrete Wavelet Transform and Inverse Wavelet Transform implementation in C that I wrote, based on the algorithm in the Numerical Recipes. The coding in C is done mainly because it is fast, and it can be called from Python as you can check on the Wavelets.py code in that same folder.

After the above is done you are done! From now on, I will assume you are working on the same directory as the FlickerLikelihood.py file that we will use (which is on the same folder as the install.py file). Let’s make a little experiment now!

A simple problem: fitting a line with 1/f + white gaussian noise

Let’s solve the simple problem of fitting a line of the form f(t)=at+b which has both additive white and 1/f noise. In the following plot, I have created 1/f noise following the method of Paul Burke, with \beta=1 (in red), and I added white gaussian noise in order to make the problem even more challenging (black dots):

noise_figure

The next step is to add a model to that noise. I added a model of the form f(t)=at+b, with a=0.5 and b=3, which is plotted in the following figure (you can download the dataset from here to run the codes showed here):

dataset_figure

And now comes the code. Using the FlickerLikelihood.py file contained in the package, which calculates the likelihood derived by Carter & Winn (2009) with the function get_likelihood, which takes as inputs (1) the residuals, (2) the standard deviation of the white noise, \sigma_w  and (3) a parameter that defines the amplitude of the flicker noise model, \sigma_r, one can create a likelihood using our model with PyMC, which I called LineFlickerModel.py:

import sys
sys.path.append("WaveletCode")
import FlickerLikelihood
import numpy as np
from pymc import *
import Wavelets

#  Functions  #
def model(t,a,b):
    return a*t+b

# Get the data #

t,data = np.loadtxt('flicker_dataset.dat',unpack=True)

# Priors #

b = Uniform('b',-10,10)
a = Uniform('a',-1,1)
sigma_w = Uniform('sigma_w',0,100)
sigma_r = Uniform('sigma_r',0,100)

# Likelihood #
@observed
def likelihood(value=data,t=t,a=a,b=b,sigma_w=sigma_w,sigma_r=sigma_r):
    """Likelihood of observed signal data"""
    residuals=data-model(t,a,b)
    return FlickerLikelihood.get_likelihood(residuals,sigma_w,sigma_r)

With this code, now I can run some MCMC links using PyMC; the following code uses the above code and draws samples from it:

import numpy as np
import matplotlib.pyplot as plt
import pymc
import LineFlickerModel

# Set the number of iterations and burnins:
niterations = 1e5
nburn = 1e4

# Set MAP estimates as starting point of MCMC. First, calculate MAP estimates:
M = pymc.MAP(LineFlickerModel)
M.fit()
# Now set this as starting point for the MCMC:
mc=pymc.MCMC(M.variables)

# And start the sample!
mc.sample(iter=niterations+nburn,burn=nburn)

# Plot the final samples (posterior samples):
pymc.Matplot.plot(mc)
plt.show()

The results you will get will be the posterior samples for each parameter. In particular, for the fitted parameters I get:

a=0.501 \pm 0.008

b=3.04 \pm 2.8

(Here I cited the standard deviations of the posterior samples as errors). Note the huge errorbar on b: this is good! It is good because due to the correlated noise, we are very uncertain about the value (recall that the 1/f noise + white gaussian noise is a stochastic process and, as such, we only observe one realization of it; we can take into account this, but this won’t make our measurements more precise, only more accurate). What if we try fitting a white gaussian noise model (as most people would do!), not taking into account that we have correlated noise? I will leave that problem to the reader, but the MCMC samples of both, the gaussian white noise run (black points) and the 1/f + gaussian white noise run (red points), are plotted on the following figure:

posterior_figure

The results I get for the parameters are:

a_{\textnormal{g}}=0.493 \pm 0.002

b_{\textnormal{g}}=4.56 \pm 0.63

Note the very little errorbar on the estimated parameter b_{\textnormal{g}}! This is BAD! What this shows (and you can generate more datasets and simulate this claim; which is actually something Carter & Winn (2009) already did) is that, in general, the errorbars on the parameters will be more realistic if you take into account the proper noise model: on average you will get the same results, but for a particular dataset, the errorbars taking into account the 1/f noise model will be more realistic than the ones you would get by assuming a gaussian (or any) white noise model. For the statisticians on the crowd: the estimator taking into account the 1/f noise has a lower variance than the estimator taking into account white gaussian noise.

The challenge in real life is, of course, to know a-priori what kind of noise you actually have. There are several tools for this that are out of the scope of this blog post but, if you are really interested, you can check some of them on our work on this subject on Jordán, Espinoza, et. al (2013); the paper is a little technical in general, but the relevant section to this problem is Section 4.2.

Conclusions

I want to make the conclusions super short (because this was intended only as a tutorial on how to use my code), but in summary what I want to say is:

  1. Know your noise (or at least try to understand it).
  2. This is the best advice I can give to anyone around the world: small errorbars don’t imply high accuracy.

This is it! If you have any questions, remarks, thanks or find any bugs, please, the comment section is open!

1: To be fair, the real reasons of why I wanted to make a Python implementation were that: (1) I stopped coding in IDL as soon as I knew Python existed, (2) most of my colleagues work in python and (3) I really wanted to make use of PyMC for this kind of analyses.

Defining which process goes to which core

Lately I’ve been working on really computer expensive processes which need to be run on different cores in order to obtain the results in publishing-human scale (~1 week). So far I’ve been just running the codes in terminal, but today I discovered that you can actually choose to what processor each code goes with a simple command!

Let’s suppose you have a process in a python code, say, myprocess.py, and want to run it on core X. Then, in a terminal, you do:

taskset -c X python myprocess.py

Now, the thing is to know which of the cores is busy with something, in order to run the process in other core. I’ve just installed the sysstats package (apt-get install sysstats) which, once installed, enables you to run the mpstat process, which is very helpful. You run it by doing

mpstat -P ALL 1

And this will give you something like this every second (change the “1” after ALL to “2” if you want to obtain results every 2 seconds, etc.):

01:25:27 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
01:25:28 PM  all   21.09    0.00    0.05    0.00    0.00    0.00    0.00    0.00   78.86
01:25:28 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
01:25:28 PM    1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
01:25:28 PM    2    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
01:25:28 PM    3  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
01:25:28 PM    4  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
01:25:28 PM    5  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
01:25:28 PM    6  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
01:25:28 PM    7    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
01:25:28 PM    8    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
01:25:28 PM    9    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
01:25:28 PM   10    0.00    0.00    0.99    0.00    0.00    0.00    0.00    0.00   99.01
01:25:28 PM   11    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
01:25:28 PM   12    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
01:25:28 PM   13    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
01:25:28 PM   14    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
01:25:28 PM   15    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

To check what everything means, go to the mpstat description, but basically the number below %usr this tells you which of the cores is being used and which is not (be aware that this shows both, virtual and physical cores). Currently, I’m only using CPU 3,4,5 and 6, which have python codes running on them.

Let’s awaken core number 2 so you can see the difference with mpstat (note that I have two terminal windows open: one with mpstat running and one which will run the python code that I’m going to execute in core number 1). I have the following code, named test.py:

import numpy as np
x = 0.0
n = 1000000
for i in range(n):
    x = x+np.double(i)
print x/np.double(n)
which just takes the mean of the first n numbers. Then, I run it on core 1 by doing:
taskset -c 1 python test.py
And the output of mpstat is:
01:26:27 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
01:26:28 PM  all   21.09    0.00    0.05    0.00    0.00    0.00    0.00    0.00   78.86
01:26:28 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
01:26:28 PM    1   41.12    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
01:26:28 PM    2    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
01:26:28 PM    3  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
01:26:28 PM    4  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
01:26:28 PM    5  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
01:26:28 PM    6  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
01:26:28 PM    7    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
01:26:28 PM    8    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
01:26:28 PM    9    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
01:26:28 PM   10    0.00    0.00    0.99    0.00    0.00    0.00    0.00    0.00   99.01
01:26:28 PM   11    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
01:26:28 PM   12    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
01:26:28 PM   13    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
01:26:28 PM   14    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
01:26:28 PM   15    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
See the change in %usr on core 1 from 0.00 to 41.12? Core 1 has been awakened!

Higgs boson “discovery” and how to NOT interpret p-values

Image

July 4th was an amazing day for science: a candidate for the realization of a field (well, the quantum fluctuation of it) – known as the Higgs field – could be detected: the Higgs boson. The importance of the Higgs field is enormous: it is the responsible of giving mass to everything we see around! From galaxies, stars, planets, us, atoms, etc.

If you are really interested on the physics of this mechanism, you can search around the CERN official webpage for info. I’m not talking here about the details of the physics (which, to be sincere, I don’t entirely understand) of the mechanism and the interpretations, but a little about the details of the detection and a common misinterpretation of it that I’ve seen (and heard) around.

The thing is that physicists at CERN don’t actually “see” the Higgs boson; they see the decays of it (i.e., the transformation of the energy of the boson into other different particles). Almost all particles decay in some form (and most of them very rapidly, like the Higgs!). Perhaps the most common decay is the one of Carbon-14, which allows us to estimate the age of different materials that contain carbon trough radicarbon dating. As Carbon-14, the Higgs decays in various forms but the most important ones that were detected at CERN were two: it’s decay into two photons and it’s decay into four leptons.

The analysis from there is “simple” (I’m obviously skipping some details, but this is just to have a big picture of the process). Protons collide at the LHC, and the resultant energy transforms into different particles. So if the Higgs is among those particles, it has to decay into different other particles; let’s analyse the photon-photon decay. From all the Higgs bosons that appear at the collision, a number of them decay into two photons, were the sum of the energy of the two has a definite value: the one that corresponds to the Higgs mass (energy conservation!). So in every multiple collision at the LHC, you take all photons and you pair them up. For example, if you see 3 photons, you sum the energies of the first to the second and have a pair, the second to the third and have another pair, etc. The people at CMS did this and found the following plot:

Photon-photon pair mass/energy

Here the y-axis is the number of (weighted) events (i.e. a measure of how many pairs with a given energy you saw), and the x-axis shows the energy of the sum of the photons (a direct measure of the mass of the particle). The dotted line is the model without the Higgs boson and the red line is the model with the Higgs boson: apparently the model with the Higgs fits pretty well!

So far so good. However, here comes the misinterpretation of the results: from all of the above, physicists calculate what is called a (local) p-value, to measure the significance of the “bump” that we’ve seen on the plot above. For anyone who hasn’t taken any statistic courses, the p-value is the  the probability of observing data at least as extreme as that observed, given that the null hypothesis is true. I must be clear with this definition, and to be clear, what better than an example? Imagine I have a coin and I throw it twenty times, with the following results (H: heads, T: tails):

T H H H H H H H H H H H H T H T H H H H.

I now ask you: do you think the coin is unbiased? Your answer would probably be no (there are clearly more heads than tails!). Now I ask you again…why? You might reply that, given that the coin is unbiased, the results that I showed to you are very unlikely, so you’ll probably reject the hypothesis that the coin is unbiased. The p-value serves for doing this in a more “orthodox” way: you choose a null hypothesis, H_0 (in this case H_0 : “the coin is unbiased”), an alternative hypothesis, H_1 (in this case H_1 : “the coin is biased”), and you calculate the probability of observing data at least as extreme as that observed, given that the null hypothesis is true (in our case the probability of observing data as least as extreme as that observed, “given that the coin is unbiased”). If this probability is low enough [1] (i.e., if given the null hypothesis H_0 : “the coin is unbiased”, the probability of observing data at least as extreme as mine is apparently low, which is our case, i.e., it is unlikely to observe our data or more extreme combinations of head and tails given that the coin is unbiased), then you usually reject the null hypothesis (which was our case: we rejected the hypothesis H_0 : “the coin is unbiased”). If the probability is not low enough, you usually can’t say anything, so you fail to reject the null hypothesis. I’ll warn you to be very careful with these statements: rejecting the hypothesis is not the same as proving it wrong. The results of the coin that I showed could come in fact from an unbiased coin, however, among all the possible combinations of heads and tails after twenty throws, it is very unlikely to obtain the data that we saw: this is what a significance test helps you decide. In the same way, failing to reject the null hypothesis is not the same as proving the null hypothesis.

I’m going to be pretty hard on the above definition: the p-value IS NOT the probability of the coin being unbiased. Please save this in your mind:

P(H_0)\ \text{or}\ P(H_0|D) \neq P(D_e|H_0) = \text{p-value},

where,

P(.):\ \text{Probability.}\\  D_e\ \ : \ \text{Data at least as extreme than observed}.

Let’s calculate the p-value for the coin example, just to be clear. Recall that the probability of obtaining k successes (say, heads) in n=20 tosses of a coin, where the probability of heads and tails is the same (i.e., f=0.5), is given by

p(k|n,f)=\binom{20}{k}0.5^k(1-0.5)^{20-k}.

The p-value is not the same as just the probability of obtaining the 17 heads that I showed to you on the example. It’s also the probability of any event at least as extreme as that, i.e., obtaining 18 heads, 19, etc. However, we must multiply this by two to account for extreme values at the other side of the distribution [2] (i.e., obtaining 3 heads, 2 heads, etc.) because those events are at least as extreme as ours too[3]! To be pretty clear on this point, a summary of the values that I’m going to sum are given in the following plot (in red):

In our case, then, the p-value is, mathematically:

P(D_e|H_0)=2\times \sum_{k=17}^{20}p(k|n,f)=0.0026=2.6\times 10^{-3},

which is, in fact, pretty low (a p-value below 0.01 is usually considered “significant at the 99% level”), i.e., it is very unlikely to obtain the data that we obtained or more extreme values of it, given that the null hypothesis is true (given that the coin is unbiased). Again, this is not the probability of the coin being unbiased, it is just the probability of obtaining a result at least as extreme as ours given that the coin is unbiased (given H_0). With all this in hand, let’s take a look at a plot that was released by the ATLAS team yesterday:

Here the y-axis is the (local) p-value which, from what I’ve read, is the p-value where the null hypothesis is “the signal is random background noise” and the x-axis is the mass of the Higgs boson. I have to state that this is my interpretation of what I’ve read, so I might be wrong here. Anyways, let’s continue. There’s a pretty well defined bump at 126 GeV, suggesting that the mass of this new discovered particle is 126 GeV. If mine is the correct interpretation of the p-value that is shown here, I’m not entirely sure if this is the correct way of presenting the analysis. First of all, comparing p-values is dangerous: given that the null hypothesis is false, the p-value has a defined distribution which is hopefully skewed towards zero (yes, the p-value IS A RANDOM VARIABLE). On the other hand, given that the null hypothesis is true, the p-value has a uniform distribution between 0 and 1 (by the probability integral transform). Given all this, I really don’t know if this is a meaningful plot at all.

Again, if my interpretation of the p-value that appears here is correct, I have something to say. I’ve been hearing lately (even from physicists) that the p-value that appears here is “the probability of the signal being random noise”. However, if my interpretation is correct, this is FALSE. The p-value is the probability of obtaining data as least as extreme as CERN’s given that the signal is random noise (i.e., here H_0 : “the signal is background random noise”). Stating that this is the probability of the signal being random noise is as stating that the p-value in our coin example is the probability of the coin being unbiased. That’s nonsense.

I also read from the official ATLAS release that the p-value plot shows “the probability of background to produce a signal-like excess”. If my interpretation of the local p-value is correct, then this is also wrong: the p-value shown means that, in a universe without the Higgs, one experiment in three million would see data at least as extreme as ours (which, as we saw in our coin example, is a little more complicated than just the probability of the background producing the signal of the Higgs: in the case of gaussian errors, the p-value would have been probably the integral of a chi-squared distribution or even the integral of a gaussian…it’s value is even more complicated to interpret and, hence, to compare!). Even more, because the p-value has a defined distribution, this is just an estimation of the possible values of the p-value; if we perform the experiment once again, I can guarantee you that the p-value will be different (and, depending on the distribution of the p-value given that the null hypothesis is false, it could be even closer to 0 than the one reported by the ATLAS team).

My conclusion is that significance testing wasn’t appropiate for a huge result like the one we saw, because it is not only confusing for the people that read the results, but even to physicists around the world trying to explain the results. If you wanted to talk about probabilities of hypothesis or of parameters, why not use, for example, bayesian data analysis or information theoretic criteria for model selection?

[1]:  How low enough is enough? Well, usually a probability below 0.01 is good enough (which is called a 99% significance level), but it really depends on the subject of study. In fact, depending on the paradigm, this defines the probability of a Type I error, i.e., the probability of rejecting the null hypothesis when it is in fact true.

[2]:  This is usually called a “two-sided p-value”. This p-value has some problems in the case of non-symmetric distributions, but that’s another story.

[3]:  This may sound strange for most people, but here’s the catch: the general idea is to always try to avoid Type I errors, i.e., rejecting the null hypothesis when it is in fact true. Think of it as a criminal trial: we want to avoid proving an innocent man guilty (in that example, H_0 : “The man is innocent”).