First, let's make a list to play with.
In [ ]:
ourList = [0,1,2,3,4,5,6,7,8,9]
Remember, we can use loops with dummy variables to iterate over lists and perform operations on each element. For example, say we want to print ourList
with each element multiplied by 10. We can use a while loop to do this:
In [ ]:
i = 0
while i < 10:
num = ourList[i] *10
print(num)
i = i+1
We can also use conditional statements like if and else to implement more complex logic. What if we wanted to print out all the elements in ourList
that are smaller than 5? We can use conditional statements to do this:
In [ ]:
i = 0
while (i<10):
num = ourList[i]
if num < 5:
print(num)
else:
print("The number is not less than 5")
i = i+1
Finally, recall that we can use the matplotlib module to plot data. Say we wanted to plot each element of ourList
versus the square of each element.
First, we must import the module. Then, we use a magic command that makes the figure appear within the cell.
In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline
squareList = [0,1,4,9,16,25,36,49,64,81]
plt.plot(ourList,squareList)
# plt.show()
Today, we will learn about NumPy, learn how to define our own functions, and learn about handling data in Python.
As we learned in the previous lecture, Python is a modular language. We can add tools and functionality as we need them, in the same way that we imported matplotlib. Today we'll be learning about a module called numpy
(short for "numerical Python"). Among other things, numpy will give us access to arrays. These are similar to lists but more flexible in the operations we can perform on them.
First, let's create a regular old list to play with.
In [ ]:
c = [0,1,2,3,4,5,6,7,8,9]
What if we wanted to square each element of the list? Give it a try:
In [ ]:
d = c**2
It doesn't work! Regular Python lists don't have this functionality. Let's convert our list to a NumPy array and see what happens.
First, we'll need to import the numpy module. We typically abbreviate it as np
.
In [ ]:
import numpy as np
To convert our list to a numpy array, we'll use np.array()
.
In [ ]:
c = np.array(c)
Now what happens if we try to square each element?
In [ ]:
d = c**2
print(d)
Did it work?
There are a few easier ways to create arrays besides creating a list and turning it into a numpy array. These include:
np.arange(start_,stop_,step_)
np.linspace(first_,last_,num_)
These create arrays of numbers within a range with a fixed step-size between each consecutive number in the array. You can try these out below.
In [ ]:
np.arange(0,11,1)
In [ ]:
np.linspace(0,9,10)
What if we want to fill an array with a different list of numbers? Sometimes it is handy to create an array of all zeros, which can then be replaced later with data. This can be done by using the command np.zeros()
.
Let's say we want to store 10 numbers in an numpy array for easy access in the future. To ready such an array, we call the function np.zeros()
with the size of the array as an input variable.
In [ ]:
data = np.ones(10) * 2#what do i put here?
print(data)
We can also assign new values to elements of existing arrays, using the following "square bracket" notation:
array_name[index_number] = value
This is the same as the list indexing we taught you in Lecture #1.
This command will replace whatever value is currently in the position corresponding to index_number
in the array called array_name
with the value stored in value
.
Recall that arrays are numbered starting from 0, such that
Try it out yourself below
In [ ]:
data = np.zeros(10)
print(data[0])
data[0] = 137#
print(data[0])
Now we move onto slightly more sophisticated ways of accessing elements of an array.
Let's say you wanted the last element of the array, but you don't recall the size of the array. One of the easiest ways to access that element is to use negative indexing.
Negative indexing is the same as normal indexing, but backward, in the sense that you start with the last element of the array and count forward. More explicitly, for any array:
Now then, let's create an array using np.arange()
with 10 elements, and see if you can access the last element and the second to last element using negative indexing. Print out these values.
In [ ]:
#Your code goes here
a = np.arange(0,10,1)
print(a)
print(a[-1])
print(a[-2])
Sometimes it's useful to access more than one element of an array. Let's say that we have an array spanning the range [0,10] (including endpoints), with a step size of 0.1. If you recall, this can be done via the np.linspace()
or np.arange()
functions.
In [ ]:
x = np.arange(0,10,.1)
print(len(x))
In [ ]:
y = np.linspace(0,9.9,100)
In order to get a range of elements rather than simply a single one, we use the notation:
x[start_index:end_index+1]
For example, let's say you want the 1st, 2nd, and 3rd element, then you'd have to do
x[0:3]
We call this kind of array manipulation "array slicing." In this notation, ":" represents you want everything between 0 and 3, and including 0. Let's test this.
In [ ]:
x[0:3]
If you want all the elements beyond a certain point of the array (including that point), then you would leave out the second index, for example:
x[90:]
would give you everything after (and including) the index 90 element. Similarly, if you want all the elements before a certain index, you can leave out the first number:
x[:90]
would give you everything up to the index 90 element.
So, let's say that you would want everything up to and including the tenth element of the array $x$. How would you do that?
(Remember, the tenth element has an index of 9)
In [ ]:
#Your code goes here
x[:10]
To practice with this, try to select just the first half of the array.
In [ ]:
#Your code goes here
x[:50]
Then, pick out middle sixty elements of the array.
In [ ]:
#Your code goes here
x[20:80]
Finally, use only the ":" to get all the elements in the array.
In [ ]:
#Your code goes here
x[:]
So far, we have focused on learning built-in functions (such as from numpy
and matplotlib
), but what about defining our own? This allows you to clean up your code, and apply the same set of operations to multiple variables without having to explicitly write them out every time.
For example, let's say we want to define a function that takes the square root of a number. Let's check to make sure the number is positive first, so that we don't end up with an imaginary answer.
In [ ]:
# What happens if you take square root of a negative number?
n = (-2)**(1/2)
print(n)
In [ ]:
#Defining a square root function
def sqrt(n):
if(n < 0):
print("Negative numbers not allowed")
else:
return n**(1/2)
In [ ]:
sqrt(4)
In [ ]:
sqrt(-4)
So the outline for a function is
def <function name> (<input variable>):
<some code here>
return <output variable>
In general, many common mathematical functions like sqrt, log, exp, sin, cos can be found in the numpy module. So we don't have to write our own - phew!
In [ ]:
import numpy as np
pi = np.pi
print(np.sqrt(25))
print(np.sin(pi/2))
print(np.exp(pi)-pi)
In [ ]:
np.log(10)
In [ ]:
x = np.linspace(0,10,100)
y = np.log(x)
plt.plot(x,y)
If you want more information on this, the documentation of possible functions that can be applied to integers and floats (i.e. single numbers), as well as numpy arrays, can be found here: https://docs.scipy.org/doc/numpy/reference/routines.math.html
When defining your own functions, you can also use multiple input variables. For example, if we want to calculate the length of a vector $(x,y)$, we can create a function that takes in the components $x$ and $y$ individually.
In [ ]:
#Make sure you run this cell!
def length(x, y):
"""Calculates the length of a vector (x,y) using the Pythagorean theorem."""
return np.sqrt(x**2+y**2)
If we call this function on the vector (3,4), we should get 5.
In [ ]:
length(3,4)
In this lecture, we've learned about numpy arrays, loops, and defining functions. You'll have a chance to test these skills in the following exercises!
In [ ]:
def even(array):
#Your code here
Use the lines below to test the function you wrote
In [ ]:
testList = [0,1,2,3,4,5,6,7,8,9,10]
even(testList)
For a given set of points, the pathlength $L$ from $(x_0,y_0)$ to $(x_n,y_n)$ is given by the following expression, \begin{gather} L = \sum_{i = 1}^n \sqrt{ \left(x_i - x_{i-1}\right)^2 + \left(y_i - y_{i-1} \right)^2} \end{gather}
What this quantity represents is the sum of the lengths between $(x_{i-1},y_{i-1})$ and $(x_i,y_i)$ for $i$ between 1 and $n$.
Write a function pathLength
which computes $L$ given two numpy arrays x_array
and y_array
as input variables. You'll need this function later on to work on the challenge problem.
In [ ]:
def pathLength(x_array,y_array):
#Your code goes here
L = 0
i = 1
while (i < len(x_array)):
L = L + length(x_array[i] - x_array[i-1],y_array[i]-y_array[i-1])
i = i+1
return L
Test your function on the example below. Your answer should come out to $4\sqrt{2} \approx 5.657$
In [ ]:
x = np.array([1,2,3,4,5])
y = np.array([1,2,3,4,5])
pathLength(x,y)
So, we have learned a lot about data arrays and how we can manipulate them, either through mathematics or indexing. However, up until this point, all we've done is use arrays that we ourselves created. But what happens if we have data from elsewhere? Can Python use that?
The answer is of course yes, and while there are ways to import data that are a bit complicated at times, we're going to teach you some of the most basic, and most useful, ways.
First, if you haven't already, import the matplotlib module.
In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline
Now then, let's say we are doing a timing experiment, where we look at the brightness of an object as a function of time. This is actually a very common type of measurement that you may do in research, such as looking for dips in the brightness of stars as a way to detect planets.
This data is stored in a text file named timeseries_data.txt
in the directory lecture2_data
. Let's load it in.
In [2]:
import numpy as np
timeseriesData = np.loadtxt("./lecture2_data/timeseries_data.txt")
Now we have the data loaded into Python as a numpy array, and one handy thing you can do is to use Python to find the dimensions of the array. This is done by using the array.shape
method as so.
In [3]:
timeseriesData.shape
Out[3]:
From the output of this function, we know that timeseriesData
is a 2x1000 array (two rows, 1000 columns).
This is an example of a 2-dimensional array, also known as a matrix.
The first row of timeseriesData
gives the time stamp of when each measurement was taken, while the second row gives the measured value of the brightness at that time.
For ease of handling this data, one can in principle take each of these rows and create new arrays out of them. Let's do just that.
Since timeseriesData
is 2-dimensional, each element has two indices.
In [4]:
t = timeseriesData[0,:] # this represents the time
signal = timeseriesData[1,:] # this represents the brightness
By convention, we first specify the row index followed by the column index.
array_name[n,:]
is the n-th row, and all columns within that row.array_name[:,n]
is the n-th column, and all rows within that particular column.Now then, let's see what the data looks like using the plot()
function with t
as your x-axis and signal
as your y-axis.
In [10]:
%matplotlib nbagg
import matplotlib.pyplot as plt
#Your code goes here
plt.close()
plt.plot(t,signal)
Out[10]:
Looking at our data, you see clear spikes that jump well above most of the signal. (I've added this to the data to represent outliers that may sometimes appear when you're messing with raw data, and those must be dealt with). In astronomy, you sometimes have relativistic charged particles, not from your source, that hit the detector known as cosmic rays, and we often have to remove these.
There are some very complex codes that handle cosmic rays, but for our purposes (keeping it easy), we're going to just set a hard cut off of, let's say 15.
In order to do this, we can use conditional indexing in place of normal indices. This involves taking a conditional statement (more on those later) and testing whether it evaluates to True on each element in the array.
This gives an array of True/False, which we can use as logical indices to select only the entries for which the logical statement is True.
In [6]:
cutOff = 15.
signalFix = signal[signal < cutOff]
In this case, the conditional statement that we have used is signal < cutOff
.
Here, conditional indexing keeps the data that we have deemed "good" by this criteria. We can also do the same for the corresponding time stamps, since t
and signal
have the same length.
In [7]:
tFix = t[signal < cutOff]
Now let's plot it. You try.
In [11]:
#Your code goes here
plt.close()
plt.plot(tFix,signalFix)
Out[11]:
Now that you have your data all cleaned up, it would be nice if we could save it for later and not have to go through the process of cleaning it up every time. Fear not! Python has you covered.
There are two formats that we are going to cover, one that is Python-specific, and the other a simple text format.
First, we must package our two cleaned up arrays into one again. This can be done simply with the np.array()
function.
In [12]:
dataFix = np.array([tFix,signalFix])
Then, we can use either the np.save()
function or the np.savetxt()
function, the first saving the array into a '.npy' file and the other, into a '.txt' file. The syntax is pretty much the same for each.
In [13]:
np.save('./lecture2_data/dataFix.npy',dataFix)
np.savetxt('./lecture2_data/dataFix.txt',dataFix)
Now that your data files are saved, you can load them up again, using np.loadtxt()
and np.load()
for .txt and .npy files respectively. We used np.loadtxt()
above, and np.load works the same way. So, let's load in the .npy file and see if our data was saved correctly.
In [ ]:
data = np.load('./lecture2_data/dataFix.npy')
t = data[0,:]
signal = data[1,:]
plt.close()
plt.plot(t,signal)
plt.show()
Now, let's see if you can do the same thing, but with the .txt file that we saved.
In [ ]:
#Your code goes here
Let's combine what we learned about loops to make our data workflow more efficient. Suppose we have a set of data saved in separate text files that we would like to load automatically. For example, in ./lecture2_data/
you will find files c1.dat
, c2.dat
, c3.dat
, c4.dat
, c5.dat
, c6.dat
.
Rather than loading each of these files individually, you can use a for (or while) loop, constructing a string at each iteration corresponding to each of these files.
In Python you can use +
to concatenate strings together. Here's an example.
In [20]:
'5'*5
Out[20]:
In [21]:
first_string = 'a'
second_string = 'b'
print(first_string + second_string)
You can also cast an integer to a string using the str
command.
In [22]:
first_string = 'a'
second_string = str(1)
print(first_string + second_string)
In [27]:
datalist = []
This is an odd idea, defining an list variable without any elements, so instead think of it as a basket without anything inside of it yet. We will use the append()
function to fill it.
Next, we call np.loadtxt
on a single .dat
file and add it to datalist
using the command
datalist.append(loadedFile)
where loadedFile
is the variable we've assigned the file to after loading it in.
In [28]:
loadedFile = np.loadtxt('./lecture2_data/c1.dat')
datalist.append(loadedFile)
Now it's your turn. Can you figure out how to load the rest of the data files into datalist
automatically?
Hint: The names of the remaining files are are c2.dat
, c3.dat
, c4.dat
, c5.dat
, and c6.dat
. What is the only thing that changes among these names? Can you think of a way to generate this part separately and combine it with the rest of the string?
In [ ]:
# Your code here
So, to summarize, not only can you manipulate arrays, but now you can save them and load them. In a way, those are some of the most important skills in scientific computing. Almost everything you'll be doing requires you know this, and now that you've mastered it, you're well on your way to being an expert in computational physics and astronomy!