In [8]:
# Set up the look of plots
import seaborn as sns
sns.set_style(rc={'font.family': ['sans-serif'],'axis.labelsize': 20})
sns.set_context("notebook")
In this script I will describe some work I did cluster the pulsar population using Gaussian mixture models.
Pulsars come in a range of types which we can imagine as a subpoopulations of the whole population. Two of the largest are normal radio pulsars and millisecond pulsars characterised by a much larger spin-frequency. To see an overview of all the types have a look here.
In this work I wanted to come up with a method to distinguish between these populations.
Firstly we need to get hold of the data, this comes courtesy of the ANTF pulsar catalouge. Here is a make file to download and extract the data we need:
In [9]:
%%writefile Makefile
DOWNLOADED = psrcat_pkg.tar
ATNF_DATABASE = psrcat_tar
DATA_FILE = ATNF_data_file.txt
PSRCAT_FILE_PATH = ./psrcat_tar/psrcat.db
all: $(DATA_FILE) $(ATNF_DATABASE)
.PHONY: clean
$(ATNF_DATABASE):
wget http://www.atnf.csiro.au/people/pulsar/psrcat/downloads/psrcat_pkg.tar.gz
gunzip psrcat_pkg.tar.gz
tar -xvf psrcat_pkg.tar
cd psrcat_tar && ls && . makeit
$(DATA_FILE):
export PSRCAT_FILE=${PSRCAT_FILE_PATH}; \
./psrcat_tar/psrcat -c "name F0 F1 F2 BINARY TYPE W10" -nonumber -o short > ${DATA_FILE}
clean:
rm -f ${DOWNLOADED}
Unfortunately using a makefile in the notebook is kind of buggy so it is best to run this from the command line:
$ make
This will have created a file ATNF_data_file.txt in the local directory which looks like:
In [10]:
! head ATNF_data_file.txt
For the time being, we can just import the data which we are interested in, namely the frequency 'F0'. We now import it, sanitize it and plot a histogram to motivate the rest of the work
In [11]:
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
%matplotlib inline
DATA_FILE = "ATNF_data_file.txt"
data = np.genfromtxt(DATA_FILE, skip_header=4, skip_footer=1, dtype=None)
F0 = np.genfromtxt(data[:, 1])
F1 = np.genfromtxt(data[:, 2])
F0 = F0[np.isfinite(F0)]
F1 = F1[np.isfinite(F0)]
F0 = F0[np.isfinite(F1)]
F1 = F1[np.isfinite(F1)]
data = np.log10(F0)
out = plt.hist(data, bins=50)
plt.xlabel("$\log_{10}(f)$")
plt.ylabel("Count")
plt.show()
As you can appreciate there are two distinct populations of pulsars: the larger regular population and the smaller millisecond pulsars which have significantly larger spin frequencies. We will now apply a Gaussian mixture model to learn about the underlying populations
With skikit-learn this is an incredibly simple problem. We simply create a model which we restrict to having two components. Then fit the model to the data, this is using the Expectation maximisation method as described in the docs.
In [12]:
from sklearn.mixture import GMM
model = GMM(2)
model.fit(data)
Out[12]:
In [13]:
vals = np.linspace(-1.5, 3, 100)
logprobs, responcibility = model.score_samples(vals)
pdf = np.exp(logprobs)
plt.hist(data, bins=50, normed=True)
plt.plot(vals, pdf, lw=3)
plt.show()
This looks very promising, we can also extract the mean and standard deviations of the two Gaussian components in the mixture:
In [15]:
from scipy.stats import norm
[muA, muB] = model.means_
[sigmaA, sigmaB] = np.sqrt(model.covars_)
[wA, wB] = model.weights_
plt.hist(data, bins=50, normed=True)
plt.plot(vals, wA*norm.pdf(vals, loc=muA, scale=sigmaA), lw=3)
plt.plot(vals, wB*norm.pdf(vals, loc=muB, scale=sigmaB), lw=3)
plt.show()
In [35]:
def SmallerThanNeigbours(y):
" Return a boolean array for the entries in x smaller than both neibours"
return np.r_[True, y[1:] < y[:-1]] & np.r_[y[:-1] < y[1:], True]
split = SmallerThanNeigbours(pdf)[1:-1]
min_val = vals[split][0]
plt.axvline(min_val)
plt.plot(vals, pdf, lw=3)
print "The min_val if F0 = {:2.2f} Hz".format(10**(min_val))
In [16]:
labels = model.predict(data)
ax = plt.subplot(111)
ax.scatter(np.log10(F0[labels==0]), np.log10(np.abs(F1[labels==0])), c="r")
ax.scatter(np.log10(F0[labels==1]), np.log10(np.abs(F1[labels==1])), c="b")
ax.set_xlabel("$\log_{10}(f)$")
ax.set_ylabel("$\log_{10}(\dot{f})$")
plt.show()
This nicely differentiates between the population of MSPs and normal radio pulsars without having to place a limit on $f$ by hand.