This notebook is an example of how to use Spark to perform a simple analysis using high energy physics data from a LHC experiment.
The exercises, figures and data are from original work developed and published by the LHCb collaboration as part of the opendata and outreach efforts (see credits below).
Prerequisites - This work is intended to be accessible to an audience with some familiarity with data analysis in Python and an interest in particle Physics at undergraduate level.
Technology - The focus of this notebook is as much on tools and techniques as it is on physics: Apache Spark is used for reading and analyzing high energy physics (HEP) data using Python with Pandas and Jupyter notebooks.
Credits:
The original text of this notebook, including all exercises, analysis, explanations and data have been developed by the LHCb collaboration and are authored and shared by the LHCb collaboration in their opendata project at:
The library for reading physics data stored using the popular ROOT format has been developed by DIANA-HEP and CMS Big Data project. See also the code repository at:
The Spark code in this notebook has been developed in the context of the CERN Hadoop and Spark service.
Contact email: Luca.Canali@cern.ch
There are a few different ways that you can to run Python and Spark in a notebook environment.
The following are instructions to set up a lab environemnt on a low-end system (small VM or laptop).
If you have already set up Spark on a local machine or a cluster, you can just start the Jupyter notebook.
Note for CERN users: if you are using CERN SWAN service (hosted Jupyter notebooks) to run this, you can move on to the next cell.
Setup the Python environment, for example download and install Anaconda https://www.continuum.io/downloads
Set up Spark
pip install pyspark
Start the Jupyter notebook
jupyter-notebook --ip=`hostname` --no-browser
Point your browser to the URL as prompted
In [1]:
# This starts the Spark Session
# Note: Shift-Enter can be used to run Jupyter cells
# These instructions rely on internet access to download the spark-root package from Maven Central
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("LHCb Open Data with Spark") \
.config("spark.jars.packages", "org.diana-hep:spark-root_2.11:0.1.11") \
.getOrCreate()
In [2]:
# Test that Spark SQL works
sql = spark.sql
sql("select 'Hello World!'").show()
Download the data for the exercises from the CERN opendata portal:
More info on the CERN opedata initiative at http://opendata.cern.ch/about
Note for CERN users using CERN SWAN (hosted notbooks): you don't need to download data (see next cell)
Simulation data (~2 MB) - you need this file only for the first part of the notebook: working on simulation data
http://opendata.cern.ch/eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root
Measurement data (~1 GB) - you will need these files for the second part of the notebook: working on real data
http://opendata.cern.ch/eos/opendata/lhcb/AntimatterMatters2017/data/B2HHH_MagnetDown.root
http://opendata.cern.ch/eos/opendata/lhcb/AntimatterMatters2017/data/B2HHH_MagnetUp.root
Notes:
On Linux you can use wget to download the files
If you run Spark on a standalone system or VM, simply put the data in the local filesystem.
If you are using Spark on a cluster, you should put the data in a cluster filesystem, for example HDFS.
In [3]:
# Edit this with the path to the data, with a trainling "/"
# see above at "get the data" for details on how to download
# CERN SWAN users can find data already in EOS
data_directory = "/eos/opendata/lhcb/AntimatterMatters2017/data/"
# Uncomment and edit the path for locally downloaded data
# data_directory = "/home/luca/misc/opendata-project/data/"
Note: the original text of this exercise in the form relased by LHCb can be found at https://github.com/lhcb/opendata-project
Welcome to the first guided LHCb Open Data Portal project!
Before you start, you might find it helpful to find out more about matter antimatter asymmetries, what we hope to learn by studying them, and how we can detect them with experiments such as the LHCb experiment at CERN.
Here are some details that relate directly to this project:
In order to get started and check the first code that you will be writing works correctly it is best to start by analysing simulated data rather than real data from the LHC. The real data contains not only the type of events that you wish to analyse, known as the 'signal', but also events that can fake these, known as 'background'. The real data measurements are also limited by the resolution of the detector. The simplified simulation data provided here contains only the signal events and provides the results that would be obtained for a perfect detector.
IMPORTANT: For every code box with code already in it, like the one below you must click in and press shift+enter to run the code.
If the In [x]:
to the left of a codebox changes to In [*]:
that means the code in that box is currently running
In [4]:
# Boilerplate
from __future__ import print_function
from __future__ import division
%pylab inline
pylab.rcParams['figure.figsize'] = (12.0, 8.0)
If you want help with coding there is in addition to the example code, some hints within each section and a function reference list
This is a reference to Spark DataFrames and SQL
In [5]:
# Let us now load the simulated data
# as detailed above you should have downloaded locally the simulation data from
# http://opendata.cern.ch/eos/opendata/lhcb/upload/AntimatterMatters2017/data/PhaseSpaceSimulation.root
# This reads the data into a Spark DataFrame using the spark-root connector
sim_data_df = spark.read.format("org.dianahep.sparkroot").load(data_directory + "PhaseSpaceSimulation.root")
# This registers the Spark DataFrame as a temporary view and will allow the use of SQL, used later in the notebook
sim_data_df.createOrReplaceTempView("sim_data")
sim_data_df.cache() # it is a small dataset (~2 MB) so we can afford to cache it
sim_data_df.count()
Out[5]:
The data contains information about 'events' that were observed in the detector. An event refers to the particles produced when an interaction took place when two proton are collided at the LHC. The data you have includes information about particles observed in the detector after each collision. If you think of the data as a table, each row of the table is the results from a different collision. The columns of the table are different quantities measured about the particles produced in the collision.
We are interested in analysing the decays of particles called B+ or B- mesons decaying into three other mesons called kaons (K+ or K-). The events you have been given are those in which this process may have occurred. The detector has been used to reconstruct tracks that may have come from the kaons. You are given the measured momenta, charge, and likelihood of the tracks being kaons. You are given information for three tracks in each event, the ones that could be the three kaons that a B+ or B- meson has decayed into. The following information is available about each event: information list
In [6]:
# Display the first 10 rows in the sim_data_df Spark DataFrame
sim_data_df.limit(10).toPandas() # use pandas only for visualization
Out[6]:
In [7]:
# Print the schema of the simulation data
sim_data_df.printSchema() # the schema of the root file
In [8]:
# Plot a histogram of the distribution of the H1_PX variable, using Pandas
# This is a basic solution that moves all the data from the Spark DataFrame
# into a Python Pandas DataFrame. It's OK for small sata sets, but it has scalability issues
h1px_data = sim_data_df.select("H1_PX").toPandas() # select H1_PX data and moves it to Pandas
h1px_data.plot.hist(bins=31, range=[-150000, 150000], title="Histogram - distribution of H1_PX, simulation data")
xlabel('H1_PX (MeV/c)')
ylabel('Count');
In [9]:
# This example computes and plots a histogram of the H1_PX data, similarly to the previous cell
# The notable difference is that Spark SQL is used to compute the aggregations and only the final result
# is returned and transformed into a Pandas DataFrame, just for plotting.
# This vesion can scale on a cluster for large datasets, while the previous version requires to fetch all data into Pandas
histogram_h1px_df = sql("""
select round(H1_PX/10000,0) * 10000 as bin, count(1) as count
from sim_data
group by round(H1_PX/10000,0) order by 1
""")
histogram_h1px_pandas = histogram_h1px_df.toPandas()
histogram_h1px_pandas.plot.bar(x='bin', y='count', title="Histogram - distribution of H1_PX, simulation data,")
xlabel('H1_PX (MeV/c)')
ylabel('Count');
In [10]:
# This is the same query used for the histogram displayed above.
# It is here just to show the numeric values of each of the bins
sql("""
select round(H1_PX/10000,0) * 10000 as bin, count(1) as count
from sim_data
group by round(H1_PX/10000,0) order by 1
""").show(50)
Momentum is a vector quantity, it has x,y, and z components. Try calculating the magnitude of the momentum of the first kaon candidate and plotting a histogram of this, you'll need the H1_PX
, H1_PY
and H1_PZ
variables.
In [11]:
# Selects the vector components of the momentum of H1 and computes the magnitude of the vector
# Only consider data where H1_PROBK = 1.0 (note,this could be relaxed to H1_PROBK >= <some threshold value>)
p_tot = sql("""
select H1_PX, H1_PY, H1_PZ, round(sqrt(H1_PX*H1_PX + H1_PY*H1_PY + H1_PZ*H1_PZ),2) H1_PTOT
from sim_data
where H1_PROBK = 1.0""")
p_tot.show(5) # displays the first 5 rows of the result
In [12]:
# calculate a variable for the magnitude of the momentum of the first kaon
# plot a histogram of this variable
h1ptot_data_plot = p_tot.select("H1_PTOT").toPandas().plot.hist(bins=31, range=[0, 550000])
xlabel('H1_PTOT (MeV/c)')
ylabel('Count');
Histogram plotting - You can use the hist() function. The parameters bins(n) and range(x,y) allow youto plot n bins over the range x to y.
Vector Magnitude The square magnitude of a magnitude of a vector is given by the sum of the square of its of its components in the x,y and z directions: $p^2 = p_x^2+p_y^2+p_z^2$, where $p$ is the magnitude of the momentum, and $p_x,p_y,p_z$ are the components of the momentum in the X,Y, and Z directions.
Einstein's theory of special relativity relates Energy, mass and momentum. We have measured the momentum of the kaon candidates in the detector, and have just plotted one of the components of the momentum of the kaon, and the magnitude of the momentum. The invariant mass of the kaon is well known and you can look this up. We wish to determine the energy of the kaons.
Here is a brief guide to the energy-momentum relation of special relativity. Further information can be found on wikipedia pages on Invariant Mass and the Energy-momentum relation.
Now, calculate the energy of the first kaon candidate using:
In [13]:
# Computes the Energy of the kaon candidates using the formula of special relativity
# that is including the magnitude of the momentum and invariant mass
kcharged_mass = 493.677
Energy_H1 = spark.sql("""
select round(sqrt({0} + H1_PX*H1_PX + H1_PY*H1_PY + H1_PZ*H1_PZ),2) H1_Energy
from sim_data
where H1_PROBK = 1.0
""".format(kcharged_mass*kcharged_mass))
Energy_H1.show(5)
In [14]:
# Plots a histogram of the energy of the first kaon candidate
Energy_H1_data_plot = Energy_H1.toPandas().plot.hist(bins=31, range=[0, 550000])
xlabel('H1_Energy (MeV)')
ylabel('Count');
Energy calculation - Use the magnitude of momentum variable you calculated above and the known invariant mass of the kaon to work out the energy of the first hadron. Calculate the energy squared, and then the energy and plot this.
Kaon mass - you can find the kaon mass on wikipedia or in physics textbooks. There is also a reference used by particle physicists: all our knowledge of the properties of the particles are collected together by the particle data group here.
Calculate the momenta and energies of the second and third kaon candidates also.
In [15]:
# calculate variables for the energy of the other two kaons
kcharged_mass = 493.677
Energy_H2 = spark.sql("""
select sqrt({0} + H2_PX*H2_PX + H2_PY*H2_PY + H2_PZ*H2_PZ) H2_Energy
from sim_data
where H2_PROBK = 1.0
""".format(kcharged_mass*kcharged_mass))
Energy_H3 = spark.sql("""
select sqrt({0} + H3_PX*H3_PX + H3_PY*H3_PY + H3_PZ*H3_PZ) H3_Energy
from sim_data
where H3_PROBK = 1.0
""".format(kcharged_mass*kcharged_mass))
Energy_H2_data_plot = Energy_H2.toPandas().plot.hist(bins=31, range=[0, 550000])
xlabel('H2_Energy (MeV)')
ylabel('Count')
Energy_H3_data_plot = Energy_H3.toPandas().plot.hist(bins=31, range=[0, 550000])
xlabel('H3_Energy (MeV)')
ylabel('Count');
In this analysis we are looking for B+ or B- mesons (see B meson) that have decayed into the three charged kaons.
Energy is a conserved quantities. This means that you can use the energy of the three 'daughter' kaons, which you have calculated above, to calculate the energy that the B meson that decayed into them must have.
Momentum is also a conserved quantity. Hence you can also use the momenta of the 'daughter' kaons to calculate the momentum of the B meson. But be careful - momentum is a vector quantity.
Using the Energy of the B meson and the magnitude of the momentum of the B meson you can use the energy-momentum relationship again. This time you are applying it to the B meson. This will allow you to calculate the invariant mass of the B meson.
In [16]:
# calculate the energy of the B meson from the sum of the energies of the kaons
sum_kaons_energy = sql("""
select
sqrt({0} + H1_PX*H1_PX + H1_PY*H1_PY + H1_PZ*H1_PZ) +
sqrt({0} + H2_PX*H2_PX + H2_PY*H2_PY + H2_PZ*H2_PZ) +
sqrt({0} + H3_PX*H3_PX + H3_PY*H3_PY + H3_PZ*H3_PZ) as Tot_Energy
from sim_data
where H1_ProbK = 1.0 and H2_ProbK = 1.0 and H3_ProbK = 1.0""".format(kcharged_mass*kcharged_mass))
sum_kaons_energy.show(10)
In [17]:
# Calculate the momentum components of the B meson
# This is a vector sum (i.e. we sum each vector component of the kaons)
sum_kaons_momentum = sql("""
select
H1_PX + H2_PX + H3_PX as PX_Tot,
H1_PY + H2_PY + H3_PY as PY_Tot,
H1_PZ + H2_PZ + H3_PZ as PZ_Tot
from sim_data
where H1_ProbK = 1.0 and H2_ProbK = 1.0 and H3_ProbK = 1.0""")
sum_kaons_momentum.show(10)
In [18]:
# Calculate the momentum components of the B meson
# This computes the vector magnitude of the vector computed above
# we use the spark sql declarative interface as opposed to writing an SQL statement for this
# the two approaches are equivalent in Spark
sum_kaons_momentum_magnitude = sum_kaons_momentum.selectExpr("sqrt(PX_Tot*PX_Tot + PY_Tot*PY_Tot + PZ_Tot*PZ_Tot) as P_Tot")
sum_kaons_momentum_magnitude.show(10)
In [19]:
# calculate the B meson invariant mass
# plot the B meson invariant mass in a histogram
b_meson_4momentum = sum_kaons_energy = sql("""
select
sqrt({0} + H1_PX*H1_PX + H1_PY*H1_PY + H1_PZ*H1_PZ) +
sqrt({0} + H2_PX*H2_PX + H2_PY*H2_PY + H2_PZ*H2_PZ) +
sqrt({0} + H3_PX*H3_PX + H3_PY*H3_PY + H3_PZ*H3_PZ) as Tot_Energy,
H1_PX + H2_PX + H3_PX as PX_Tot,
H1_PY + H2_PY + H3_PY as PY_Tot,
H1_PZ + H2_PZ + H3_PZ as PZ_Tot
from sim_data
where H1_ProbK = 1.0 and H2_ProbK = 1.0 and H3_ProbK = 1.0""".format(kcharged_mass*kcharged_mass))
b_meson_4momentum.show(5)
In [20]:
b_meson_invariant_mass = b_meson_4momentum.selectExpr("""
sqrt(Tot_Energy* Tot_Energy - (PX_Tot*PX_Tot + PY_Tot*PY_Tot + PZ_Tot*PZ_Tot) ) as invariant_mass""")
b_meson_invariant_mass.show(5)
In [21]:
b_meson_invariant_mass.toPandas().plot.hist(bins=101, range=[4000, 6000],
title="Histogram - distribution of B meson invariant mass, simulation data")
xlabel('b_meson_invariant_mass (MeV)')
ylabel('Count');
# Note the mass of the charged B meson is expected to be 5279.29±0.15 and this is consistenet with the
# the peak in found in the data plotted here
You should have a graph that sharply peaks at the mass of the B+ meson. The mass of the B+ and B- meson are the same. Check that the peak of your graph is at the known mass of the B meson. Congratulations!
Recall that you have made this plot for simulated data. How might you expect the plots for real data to look different ? In the next section you will start to work with the real LHC data.
B Meson Energy - From energy conservation, the energy of the B meson will be the sum of the energies of the three kaons: $E_B=E_{K1}+E_{K2}+E_{K3}$, where $E_B$ is the energy of the B meson, $E_{K1}, E_{K2}, E_{K3}$ are the energies of each of the kaons.
B meson momentum - From momentum conservation, the X component of the momentum of the B meson will be the sum of the X momentum components of the three Kaons : $px_B=px_{K1}+px_{K2}+px_{K3}$, where $px$ is the X direction component of the momentum of the B meson, $px_{K1},px_{K2},px_{K3}$ are the X direction components of the momenta of the three kaons. You can then do the same with the Y and Z components. Having obtained the X,Y, and z components of the B momentum you can find the magnitude of the momentum of the B meson.
B meson invariant mass* - Rearrange the equation $E^2=p^2+m^2$ to find $m^2$. Using the values of the magnitude of the momentum of the B meson and the B meson Energy, find the mass of the B meson.
Histogram plotting - Take care that the range of your mass plot is set suitably that you can see the mass peak. Once you have found the peak you can set the range appropriately. You do not have to start your graph at a mass of 0.
Units - The data you are provided has energies in 'MeV' (106 electron volts). The mass of the B meson is often quoted in 'GeV/c2' (109 electron volts).
In the section above you have analysed the simulation data to determine the invariant mass of the B meson. Now, you can start applying the methods you have used to the real LHCb data. This data was collected by the LHCb detector at CERN during 2011, the first major year of LHC operations.
The data you are given has been filtered to select only events that are likely to have come from B+ or B- mesons decaying into three final state charged particles. You are interested in the case where these three final state paticles are charged kaons K+ or K-.
An introduction has been provided on the detector and data sample. As background information we also provide further information on the selection that has been applied to select this data sample.
You want to apply a preselection to the three final state tracks that
!H1_isMuon
where !
means not
, and similarly for H2
and H3
)H1_ProbPi < 0.5
)H1_ProbK > 0.5
)You need to find a balance between making cuts that are too loose and include too many background events and too tight and reject many of your signal events.
In order to now find the most suitable further selection cuts, make yourself familiar with how cuts can affect the significance of the final result. Feel free to come back to this stage later and adjust your cuts to see the impact.
The pre selection you create will be applied for you if give it the name 'preselection'.
We have provided an example preselection in the hints, so feel free to use that to get started if you wish. start with a loose preselection and then refine it after you have studied the plots.
In [22]:
# Create the DataFrames with ROOT data using the spark-root data format
# Only metadata is read at this stage (reading into Spark DatFrames is lazily executed)
B2HHH_MagnetDown_df = spark.read.format("org.dianahep.sparkroot").load(data_directory + "B2HHH_MagnetDown.root")
B2HHH_MagnetUp_df = spark.read.format("org.dianahep.sparkroot").load(data_directory + "B2HHH_MagnetUp.root")
In [23]:
# Put all the data together
B2HHH_AllData_df = B2HHH_MagnetDown_df.union(B2HHH_MagnetUp_df)
In [24]:
# This defines the cut criteria
# You can experiment with different criteria
preselection = """H1_ProbPi < 0.5 and H2_ProbPi < 0.5 and H3_ProbPi < 0.5
and H1_ProbK > 0.5 and H2_ProbK > 0.5 and H3_ProbK > 0.5
and H1_isMuon = 0 and H2_isMuon = 0 and H3_isMuon = 0"""
# Apply cuts to the data as a filter
B2HHH_AllData_WithCuts_df = B2HHH_AllData_df.filter(preselection)
In [25]:
# This *may take a few minutes* as data will be read at this stage
B2HHH_AllData_WithCuts_df.cache() # flags the DataFrame for caching, this is useful for performance
B2HHH_AllData_WithCuts_df.count() # triggers an action, data will be read at this stage
Out[25]:
In [26]:
# This registers the dataframe with cuts (filters) as a view for later use with SQL
B2HHH_AllData_WithCuts_df.createOrReplaceTempView("B2HHH_AllData_WithCuts")
In [27]:
# Displays a sample of the data
B2HHH_AllData_WithCuts_df.limit(10).toPandas()
Out[27]:
Make histograms of the probability of a final state particle being a kaon or a pion. These will help guide you on suitable probability values at which to cut.
You can also consider more sophisticated options like 2-D plots of kaon and pion probabilities or different values of the cuts for the different final state particles.
In [28]:
# plot the probability that a final state particle is a kaon
B2HHH_AllData_WithCuts_df.select("H1_ProbK", "H2_ProbK", "H3_ProbK").toPandas().plot.hist(bins=101, range=[0.0, 1.0])
xlabel('Probability value that the particle is a kaon')
ylabel('Count');
In [29]:
# plot the probability that a final state particle is a pion
B2HHH_AllData_WithCuts_df.select("H1_ProbPi", "H2_ProbPi", "H3_ProbPi").toPandas().plot.hist(bins=101, range=[0.0, 1.0])
xlabel('Probability value that the particle is a pion')
ylabel('Count');
Now calculate the invariant mass of the B meson for the real data and plot a histogram of this. Compare it with the one you drew for the simulation data.
Can you explain the differences you observe?
In [30]:
# calculate the B meson invariant mass
# plot the B meson invariant mass in a histogram
b_meson_4momentum_withcuts = sql("""
select
sqrt({0} + H1_PX*H1_PX + H1_PY*H1_PY + H1_PZ*H1_PZ) +
sqrt({0} + H2_PX*H2_PX + H2_PY*H2_PY + H2_PZ*H2_PZ) +
sqrt({0} + H3_PX*H3_PX + H3_PY*H3_PY + H3_PZ*H3_PZ) as Tot_Energy,
H1_PX + H2_PX + H3_PX as PX_Tot,
H1_PY + H2_PY + H3_PY as PY_Tot,
H1_PZ + H2_PZ + H3_PZ as PZ_Tot
from B2HHH_AllData_WithCuts""".format(kcharged_mass*kcharged_mass))
b_meson_4momentum_withcuts.createOrReplaceTempView("b_meson_4momentum_mycuts_read_data")
b_meson_4momentum_withcuts.show(5)
In [31]:
b_meson_invariant_mass_withcuts = b_meson_4momentum_withcuts.selectExpr("""
sqrt(Tot_Energy* Tot_Energy - (PX_Tot*PX_Tot + PY_Tot*PY_Tot + PZ_Tot*PZ_Tot) ) as invariant_mass""")
b_meson_invariant_mass_withcuts.show(5)
In [32]:
# draw a histogram for the B meson mass in the real data
b_meson_invariant_mass_withcuts.toPandas().plot.hist(bins=101, range=[4000, 6000],
title="Histogram - distribution of B meson invariant mass, real data")
xlabel('b_meson_invariant_mass (MeV)')
ylabel('Count');
Additional exercise: Experiment with the cuts and see the impact of harsher or more lenient cuts on the invariant mass plot. You should select a set of cuts which makes the signal most prominent with respect to the background. Once you have finalised the selection on particle identification also make cuts on the reconstructed particle mass to select the events in the B meson mass peak, removing the background events which lie at lower and higher invariant masses.
In this section you will start to study matter antimatter differences (CP Violation). Here 'global' means that you are looking for differences across all ranges of energy and momentum (the kinematics) of the kaons into which the charge B mesons have decayed. Later we look at 'local' differences in different regions of the kinematics.
In order to quantify the matter antimatter asymmetry in this process we wish to compare the B+ and the B- particles. The B- is the anti-particle of the B+.
How can you distinguish between events that contain B+ and B- particles using H1_Charge
, H2_Charge
and H3_Charge
?
In [33]:
# make a variable for the charge of the B mesons
B_charge_df = B2HHH_AllData_WithCuts_df.selectExpr("H1_charge + H2_charge + H3_charge as B_Charge")
Now count the numbers of events of each of the two types (N+ and N-). Also calculate the difference between these two numbers.
In [34]:
# make variables for the numbers of positive and negative B mesons
# I am using the declarative API of Spark SQl for this. If you want to use SQL you can do the following:
# B_charge_df.createOrReplaceTempView("B_charge_table")
# sql("select B_Charge, count(*) from B_charge_table group by B_Charge").show(5)
B_charge_df.groupBy("B_Charge").count().show()
In order to calculate the Asymmetry, you can make use of the formula:
(note you may need to run this box in order to see the image)
In [35]:
# calculate the value of the asymmetry, by using the formula above, and then print it
N_plus = 12390.0
N_minus = 11505.0
A = (N_minus - N_plus) / (N_minus + N_plus)
A
Out[35]:
Differentiating between N+ and N-
len(real_data.query('B_Charge == charge'))
to count the number of mesons, where B_Charge
is the variable you created and charge
is 1
or -1
.You will now need to calculate the statistical uncertainty of the asymmetry. You can do so using the formula:
The significance of the result, sigma, is found by dividing the value for asymmetry by its uncertainty. A value exceeding three sigma is considered "evidence" by particle physicists while a value of five sigma or more can be called an "observation" or "discovery".
In [36]:
# calculate the statistical significance of your result and print it
sqrt((1 - A*A)/(N_minus + N_plus))
Out[36]:
Congratulations! You have performed your first search for a matter anti-matter difference.
Here you have only considered the statistical uncertainty. Your measurement will also have other sources of uncertainty known as systematic uncertainties which you have not considered at this stage.
In this stage we introduce you to an important technique for analysing decays of one particle (your charged B meson) into three bodies (the three kaons). This is known as a Dalitz plot.
The decay of the B meson can proceed either directly to the three-body final state or via an intermediate particle. For example, B+ → K+K+K−, could proceed through the decay B+ → K+R0, where R0 is a neutral particle resonance which can decay R0 → K+K-. Dalitz plots can be used to identify these resonances which are visible as bands on the Dalitz plot.
More information about these plots and why these are used in particle physics research can be found in Dalitz Plot Introduction.
The kinematics of a three-body decay can be fully described using only two variables. The energies and momenta of the three kaons are not independent of each other as they all come from the decay of a B meson and energy and momentum are conserved. The axes of the plots conventionally are the squared invariant masses of two pairs of the decay products. It is a 2D plot, the x and y axes are both squared masses and the density of points in the plot shows the structure.
Consider our decay B+ → K+1K+2K−3, where we have numbered the kaons 1,2,3 to distinguish them. We can calculate the invariant mass of three possible combinations that could correspond to intermediate resonances R++1 → K+1K+2, R02 → K+1K-3, and R03 → K+2K-3.
The potential R++1 would be a doubly charged resonance. We would not expect to see any resonances corresponding to this as mesons are composed of one quark and one anti-quark and their charges cannot add up to two units.
The potential R02 and R03 correspond to configurations in which we could see resonances. Hence you should compute the invariant mass combinations for these. The square of these masses should be used as the Dalitz variables.
We suggest you make these plots first for the simulation data. In the simulation there are no intermediate resonances and your plot should be of uniform density inside the range physically allowed by energy and momentum conservation.
In [37]:
# calculate the invariant masses for each possible hadron pair combination
two_body_resonances_df = sql("""
select
sqrt({0} + H1_PX*H1_PX + H1_PY*H1_PY + H1_PZ*H1_PZ) +
sqrt({0} + H2_PX*H2_PX + H2_PY*H2_PY + H2_PZ*H2_PZ) as Energy_K1_K2,
sqrt((H1_PX + H2_PX)*(H1_PX + H2_PX) + (H1_PY + H2_PY)*(H1_PY + H2_PY)
+ (H1_PZ + H2_PZ)*(H1_PZ + H2_PZ)) as P_K1_K2,
sqrt({0} + H1_PX*H1_PX + H1_PY*H1_PY + H1_PZ*H1_PZ) +
sqrt({0} + H3_PX*H3_PX + H3_PY*H3_PY + H3_PZ*H3_PZ) as Energy_K1_K3,
sqrt((H1_PX + H3_PX)*(H1_PX + H3_PX) + (H1_PY + H3_PY)*(H1_PY + H3_PY)
+ (H1_PZ + H3_PZ)*(H1_PZ + H3_PZ)) as P_K1_K3,
sqrt({0} + H2_PX*H2_PX + H2_PY*H2_PY + H2_PZ*H2_PZ) +
sqrt({0} + H3_PX*H3_PX + H3_PY*H3_PY + H3_PZ*H3_PZ) as Energy_K2_K3,
sqrt((H2_PX + H3_PX)*(H2_PX + H3_PX) + (H2_PY + H3_PY)*(H2_PY + H3_PY)
+ (H2_PZ + H3_PZ)*(H2_PZ + H3_PZ)) as P_K2_K3,
H1_Charge, H2_Charge, H3_Charge
from B2HHH_AllData_WithCuts""".format(kcharged_mass*kcharged_mass))
two_body_resonances_df.limit(5).toPandas()
Out[37]:
In [38]:
# Computes 2-body resonance invariant mass from two_body_resonances_df
two_body_resonances_invariant_mass_GeV_df = two_body_resonances_df.selectExpr(
"sqrt(Energy_K1_K2*Energy_K1_K2 - P_K1_K2*P_K1_K2) / 1000.0 as Mass_K1_K2",
"sqrt(Energy_K1_K3*Energy_K1_K3 - P_K1_K3*P_K1_K3) / 1000.0 as Mass_K1_K3",
"sqrt(Energy_K2_K3*Energy_K2_K3 - P_K2_K3*P_K2_K3) / 1000.0 as Mass_K2_K3",
"H1_Charge", "H2_Charge", "H3_Charge")
two_body_resonances_invariant_mass_GeV_df.show(5)
In [39]:
# Two_body_resonances_invariant_mass_GeV_df.filter("H1_Charge * H2_Charge = -1").show()
two_body_resonances_invariant_mass_GeV_df.createOrReplaceTempView("t1")
sql("select H2_Charge * H3_Charge, count(*) from t1 group by H2_Charge * H3_Charge").show()
In [40]:
# plot the invariant mass for one of these combinations
two_body_resonances_invariant_mass_GeV_df.filter("H1_Charge + H2_Charge = 0").select("Mass_K1_K2") \
.toPandas().plot.hist(bins=101, range=[0, 10],
title="Histogram - distribution of K1_K2 resonance invariant mass, simulation data")
xlabel('b_meson_invariant_mass (GeV)')
ylabel('Count');
In [41]:
two_body_resonances_invariant_mass_GeV_df.filter("H1_Charge * H3_Charge = -1").count()
Out[41]:
In [42]:
# Make a Dalitz plot with labelled axes for the simulation data
# This is achieved by plotting a scatter graph from Mass_12 squared vs. Mass_13 squared
# As in the text of the exercise, add a filter on data that the possible resonance has charge 0
dalitz_plot_df = two_body_resonances_invariant_mass_GeV_df \
.filter("H1_Charge + H2_Charge = 0").filter("H1_Charge + H3_Charge = 0") \
.selectExpr("Mass_K1_K2*Mass_K1_K2 as M12_squared", "Mass_K1_K3*Mass_K1_K3 as M13_squared")
dalitz_plot_df.toPandas().plot.scatter(x='M12_squared', y='M13_squared',
title="Dalitz plot, B+/- meson decay into thre kaons, simulation data")
xlabel('Mass_K1_K2 squared (GeV^2)')
ylabel('Mass_K1_K3 squared (GeV^2)');
Calculating invariant mass - Use the same technique as you did above for the B meson, but now applying it to two-body invariant masses rather than three.
Plotting the Dalitz plot - You can use a scatter
plot from matplotlib
to plot a Dalitz plot, see the example analysis. Remember to use the square of each two-body mass.
In [43]:
# calculate the invariant masses for each possible hadron pair combination in the real data
In [44]:
# make a Dalitz plot for the real data (with your preselection cuts applied)
You can make a further improvement to allow you to observe the resonances easier. Your resonances R02 and R03 are both composed of the same particle types, K+K-, and hence have the same distributions. It is useful to impose an ordering which distinguishes the resonances. We can call the resonances R0Low and R0High. In each event R0Low is the resonance with the lower mass and the other corresponds to the higher mass combination of kaons. You can now use the mass of these ordered resonances as your Dalitz plot variables, thus effectively "folding" your Dalitz plot so that one axis always has a higher value than the other.
In [45]:
# make a new Dalitz plot with a mass ordering of the axes
Ordered Dalitz plot - You can find the maximum of the mass of R0Low vs R0High elementwise on one axis, and the minimum of on the other. You can use numpy.min(a,b)
and numpy.max(a,b)
to perform elementwise comparisons between two arrays a
and b
and return one array filled by either the individual min/max element from the elementwise comparisons.
In [46]:
# plot a binned Dalitz Plot
# use colorbar() to make a legend for your plot at the side
You can now use your Dalitz plot to identify the intermediate resonances that you see in your plots. The resonances will have shown up as bands of higher density of points on the plots. You can use the particle data group tables of mesons to identify which particles these correspond to. The tables give the masses and widths of the particles and their decay modes. You are looking for mesons with the masses corresponding to where you see the bands and that decay into K+K-.
Congratulations! You have succesfully made a Dalitz plot and used it to observe the presence of intermediate particles in the decay of your charged B meson into three charged kaons.
In a section above you searched for global CP violation. You probably did not find a result with very high significance.
CP violation may arise from interference between decays through different resonances, and hence the magnitude and sign of the CP violation may vary across the Dalitz plot. We can apply the same equation as in the global CP violation study
The analysis performed here is to study the CP violation in the charmless B meson decays to kaons. "charmless" means that the decay does not proceed through a charm quark. However, the most frequent decay of the B mesons occur through the b quark decaying into a c quark. The majority of these events can be removed by rejecting the events that are proceeding through a D0 meson (which contains the charm quark).
In the section above you plotted a histogram of the invariant mass of the intermediate resonances and will have observed the D0 meson in this and in the Dalitz plot. You should now reject events that are around the mass range of the D0 meson to suppress this contribution. You can do this in your pre-selection on the data that you set-up earlier in the project.
This was also a simplification that we did not consider when we were calculating the global asymmetry. After you have applied this pre-selection your code will now recompute the global asymmetry with the D0 meson contribution rejected. We have not yet observed CP violation in charm mesons and searching for this is another active area of current research.
Make separate Dalitz plots for the B+ and the B- decays. Local CP Violation will show up as an asymmetry between the B+ and the B- plots.
In order that the statistical error on the asymmetry in each bin is not over large the bins need to contain a reasonable number of entries. Hence you will probably need larger bins than when you were looking for resonances in the section above. A suitable initial bin size might be $2.5~\text{GeV}^2/\text{c}^4 \times 2.5~\text{GeV}^2/\text{c}^4$.
In [47]:
# make a Dalitz plot for the B+ events
In [48]:
# make a Dalitz plot for the B- events
In [49]:
# Make a plot showing the asymmetry between these two Daltz plots
# i.e. calculate the asymmetry between each bin of the B+ and B- Dalitz plots and show the result in another 2D plot
Observing a large asymmetry in some regions of the plot does not necessarily mean you have observed CP violation. If there are very few events in that region of the plot the uncertainty on that large asymmetry may be large. Hence, the value may still be compatible with zero.
You can calculate the statistical uncertainty on the asymmetry, for each bin of the plot, using the same formulas as you used in the global asymmetry section. You can then make a plot showing the uncertainty on the asymmetry.
Dividing the plot showing the asymmetry by the plot showing the statistical uncertainty you can then obtain the significance of the asymmetry in each bin. You can then plot the significance of the asymmetry to see if there is any evidence for CP violation.
In [50]:
# Make a plot showing the uncertainty on the asymmetry
In [51]:
# Make a plot showing the statistical significance of the asymmetry
From your studies of the asymmetry plot, and the plot of its significance, you will be able to identify regions in the Dalitz plots that show indications of sizeable and significant CP Violation. You may find you have several consecutive bins with significant positive, or negative, asymmetries. You may wish to try alternative binnings of the Dalitz plots to best isolate the regions in which the significant asymmetries occur.
You can select events that are in these regions of the Dalitz plots where you observe signs of CP Violation. You can then plot a simple 1D histogram of the invariant mass distribution of the B+ and the B- events, just as you did at the start of the project, but only for events that lie in the region of the Dalitz plot that you are interested in. Make the plots of the B+ and the B- events with the same scale, or superimpose the two plots, so that you can observe if the particle and anti-particle decay processes are occurring at the same rate.
In [52]:
# Make a plot showing the invariant mass of the B+ meson particles
# using events from a region of the Dalitz plot showing sizeable CP asymmetries
In [53]:
# Make a plot showing the invariant mass of the B- meson particles using events from the same region
Congratulations! You should now have succesfully observed significant evidence for CP Violation. You should have plots that clearly show that particle and anti-particle decay processes occur at different rates in local regions of the Dalitz plot. You may wish to comapre your results with those published by the LHCb collaboration in this paper.
Well Done you have succesfully completed your first particle physics analysis project. There are many more analyses that can be conducted witht the data set that you have been provided and the skills that you have gained. Ideas for some of these are explored in the section below. Maybe you can discover something new!
Now you've finished the analysis please provide feedback to help us improve this project using this brief survey.
The data set you have been provided is the full set of data recorded by LHCb preselected for decays of charged B mesons into three final state tracks. This data set has been used for two important publications, here and here.
We discuss here:
In this analysis you considered the statistical uncertainty on the result. This occurs as a result of having only a limited number of events. In addition there are systematic uncertainties, these arise from biases in your measurement. Here we discuss three sources of these for this analysis.
One source of 'background' events arises from random combinations of tracks in events that happen to fake the 'signal' characteristics. These events will not peak in the mass distribution at the mass of the B meson but rtaher will have a smoothly varying distribution. Looking at the number and distribution of of events away from the mass peak can allow you to estimate the number of background events under the mass peak.
The next level of sophistication in the analysis requires fitting the distributions of events that are observed in the B mass distribution in order to estimate the yield of signal events and background events. You can see how this is done in the LHCb paper on the analysis. Fitting can be performed using the CERN root framework.
The LHCb papers using this data set that you are using analysed four decay channels of the charged B mesons. You can perform any of these analyses.