Review from the previous lecture

In the previous lecture we covered basic mathematical operations, variables, and lists. We also introduced you to conditional statements, loops, and basic plotting using matplotlib. Before we move forward, we'll do a quick review.

First, let's make a list to play with.



In [ ]:

    
ourList = [0,1,2,3,4,5,6,7,8,9]

Remember, we can use loops with dummy variables to iterate over lists and perform operations on each element. For example, say we want to print ourList with each element multiplied by 10. We can use a while loop to do this:



In [ ]:

    
i = 0
while i < 10:
    num = ourList[i] *10
    print(num)
    i = i+1

We can also use conditional statements like if and else to implement more complex logic. What if we wanted to print out all the elements in ourList that are smaller than 5? We can use conditional statements to do this:



In [ ]:

    
i = 0
while (i<10):
    num = ourList[i]
    if num < 5:
        print(num)
    else:
        print("The number is not less than 5")
    i = i+1

Finally, recall that we can use the matplotlib module to plot data. Say we wanted to plot each element of ourList versus the square of each element.

First, we must import the module. Then, we use a magic command that makes the figure appear within the cell.



In [ ]:

    
import matplotlib.pyplot as plt
%matplotlib inline

squareList = [0,1,4,9,16,25,36,49,64,81]

plt.plot(ourList,squareList)
# plt.show()

Lecture 2 - NumPy, Functions, and Data

Today, we will learn about NumPy, learn how to define our own functions, and learn about handling data in Python.

An introduction to numpy and numpy arrays and a discussion of their usefulness in solving real problems
Functions and how to define new ones.
Reading in data from text and numpy file formats, along with creating your own outputs to be used later

A. Introduction to Numpy Arrays - Initialization and Advanced Indexing

As we learned in the previous lecture, Python is a modular language. We can add tools and functionality as we need them, in the same way that we imported matplotlib. Today we'll be learning about a module called numpy (short for "numerical Python"). Among other things, numpy will give us access to arrays. These are similar to lists but more flexible in the operations we can perform on them.

First, let's create a regular old list to play with.



In [ ]:

    
c = [0,1,2,3,4,5,6,7,8,9]

What if we wanted to square each element of the list? Give it a try:



In [ ]:

    
d = c**2

It doesn't work! Regular Python lists don't have this functionality. Let's convert our list to a NumPy array and see what happens.

First, we'll need to import the numpy module. We typically abbreviate it as np.



In [ ]:

    
import numpy as np

To convert our list to a numpy array, we'll use np.array().



In [ ]:

    
c = np.array(c)

Now what happens if we try to square each element?



In [ ]:

    
d = c**2
print(d)

Did it work?

There are a few easier ways to create arrays besides creating a list and turning it into a numpy array. These include:

np.arange(start_,stop_,step_)
np.linspace(first_,last_,num_)

These create arrays of numbers within a range with a fixed step-size between each consecutive number in the array. You can try these out below.



In [ ]:

    
np.arange(0,11,1)



In [ ]:

    
np.linspace(0,9,10)

What if we want to fill an array with a different list of numbers? Sometimes it is handy to create an array of all zeros, which can then be replaced later with data. This can be done by using the command np.zeros().

Let's say we want to store 10 numbers in an numpy array for easy access in the future. To ready such an array, we call the function np.zeros() with the size of the array as an input variable.



In [ ]:

    
data = np.ones(10) * 2#what do i put here?
print(data)

We can also assign new values to elements of existing arrays, using the following "square bracket" notation:

array_name[index_number] = value

This is the same as the list indexing we taught you in Lecture #1.

This command will replace whatever value is currently in the position corresponding to index_number in the array called array_name with the value stored in value.

Recall that arrays are numbered starting from 0, such that

Index of first position = 0
Index of second position = 1
Index of third position = 2
etc.

Try it out yourself below



In [ ]:

    
data = np.zeros(10)
print(data[0])
data[0] = 137#
print(data[0])

Now we move onto slightly more sophisticated ways of accessing elements of an array.

Let's say you wanted the last element of the array, but you don't recall the size of the array. One of the easiest ways to access that element is to use negative indexing.

Negative indexing is the same as normal indexing, but backward, in the sense that you start with the last element of the array and count forward. More explicitly, for any array:

array[-1] = last element of array
array[-2] = second to last element of the array
array[-3] = third to last element of the array
etc

Now then, let's create an array using np.arange() with 10 elements, and see if you can access the last element and the second to last element using negative indexing. Print out these values.



In [ ]:

    
#Your code goes here
a = np.arange(0,10,1)
print(a)
print(a[-1])
print(a[-2])

Sometimes it's useful to access more than one element of an array. Let's say that we have an array spanning the range [0,10] (including endpoints), with a step size of 0.1. If you recall, this can be done via the np.linspace() or np.arange() functions.



In [ ]:

    
x = np.arange(0,10,.1)
print(len(x))



In [ ]:

    
y = np.linspace(0,9.9,100)

In order to get a range of elements rather than simply a single one, we use the notation:

x[start_index:end_index+1]

For example, let's say you want the 1st, 2nd, and 3rd element, then you'd have to do

x[0:3]

We call this kind of array manipulation "array slicing." In this notation, ":" represents you want everything between 0 and 3, and including 0. Let's test this.



In [ ]:

    
x[0:3]

If you want all the elements beyond a certain point of the array (including that point), then you would leave out the second index, for example:

x[90:]

would give you everything after (and including) the index 90 element. Similarly, if you want all the elements before a certain index, you can leave out the first number:

x[:90]

would give you everything up to the index 90 element.

So, let's say that you would want everything up to and including the tenth element of the array $x$. How would you do that?

(Remember, the tenth element has an index of 9)



In [ ]:

    
#Your code goes here
x[:10]

To practice with this, try to select just the first half of the array.



In [ ]:

    
#Your code goes here
x[:50]

Then, pick out middle sixty elements of the array.



In [ ]:

    
#Your code goes here
x[20:80]

Finally, use only the ":" to get all the elements in the array.



In [ ]:

    
#Your code goes here
x[:]

B. Defining Your Own Functions

So far, we have focused on learning built-in functions (such as from numpy and matplotlib), but what about defining our own? This allows you to clean up your code, and apply the same set of operations to multiple variables without having to explicitly write them out every time.

For example, let's say we want to define a function that takes the square root of a number. Let's check to make sure the number is positive first, so that we don't end up with an imaginary answer.



In [ ]:

    
# What happens if you take square root of a negative number?
n = (-2)**(1/2) 
print(n)



In [ ]:

    
#Defining a square root function
def sqrt(n):
    if(n < 0):
        print("Negative numbers not allowed")
    else:
        return n**(1/2)



In [ ]:

    
sqrt(4)



In [ ]:

    
sqrt(-4)

So the outline for a function is

def <function name> (<input variable>):
    <some code here>
    return <output variable>

In general, many common mathematical functions like sqrt, log, exp, sin, cos can be found in the numpy module. So we don't have to write our own - phew!



In [ ]:

    
import numpy as np

pi = np.pi

print(np.sqrt(25))
print(np.sin(pi/2))
print(np.exp(pi)-pi)



In [ ]:

    
np.log(10)



In [ ]:

    
x = np.linspace(0,10,100)
y = np.log(x)

plt.plot(x,y)

If you want more information on this, the documentation of possible functions that can be applied to integers and floats (i.e. single numbers), as well as numpy arrays, can be found here: https://docs.scipy.org/doc/numpy/reference/routines.math.html

When defining your own functions, you can also use multiple input variables. For example, if we want to calculate the length of a vector $(x,y)$, we can create a function that takes in the components $x$ and $y$ individually.



In [ ]:

    
#Make sure you run this cell! 
def length(x, y):
    """Calculates the length of a vector (x,y) using the Pythagorean theorem."""
    return np.sqrt(x**2+y**2)

If we call this function on the vector (3,4), we should get 5.



In [ ]:

    
length(3,4)

In this lecture, we've learned about numpy arrays, loops, and defining functions. You'll have a chance to test these skills in the following exercises!

Exercise 1: Define a function that prints every even-indexed element of an array.



In [ ]:

    
def even(array):
    #Your code here

Use the lines below to test the function you wrote



In [ ]:

    
testList = [0,1,2,3,4,5,6,7,8,9,10]
even(testList)

Exercise 2 - Calculate path length.

For a given set of points, the pathlength $L$ from $(x_0,y_0)$ to $(x_n,y_n)$ is given by the following expression, \begin{gather} L = \sum_{i = 1}^n \sqrt{ \left(x_i - x_{i-1}\right)^2 + \left(y_i - y_{i-1} \right)^2} \end{gather}

What this quantity represents is the sum of the lengths between $(x_{i-1},y_{i-1})$ and $(x_i,y_i)$ for $i$ between 1 and $n$.

Write a function pathLength which computes $L$ given two numpy arrays x_array and y_array as input variables. You'll need this function later on to work on the challenge problem.



In [ ]:

    
def pathLength(x_array,y_array):
    #Your code goes here
    L = 0
    i = 1
    while (i < len(x_array)):
        L = L + length(x_array[i] - x_array[i-1],y_array[i]-y_array[i-1])
        i = i+1
    return L

Test your function on the example below. Your answer should come out to $4\sqrt{2} \approx 5.657$



In [ ]:

    
x = np.array([1,2,3,4,5])
y = np.array([1,2,3,4,5])
pathLength(x,y)

C. Loading And Saving Data Arrays

So, we have learned a lot about data arrays and how we can manipulate them, either through mathematics or indexing. However, up until this point, all we've done is use arrays that we ourselves created. But what happens if we have data from elsewhere? Can Python use that?

The answer is of course yes, and while there are ways to import data that are a bit complicated at times, we're going to teach you some of the most basic, and most useful, ways.

First, if you haven't already, import the matplotlib module.



In [ ]:

    
import matplotlib.pyplot as plt
%matplotlib inline

Now then, let's say we are doing a timing experiment, where we look at the brightness of an object as a function of time. This is actually a very common type of measurement that you may do in research, such as looking for dips in the brightness of stars as a way to detect planets.

This data is stored in a text file named timeseries_data.txt in the directory lecture2_data. Let's load it in.



In [2]:

    
import numpy as np
timeseriesData = np.loadtxt("./lecture2_data/timeseries_data.txt")

Now we have the data loaded into Python as a numpy array, and one handy thing you can do is to use Python to find the dimensions of the array. This is done by using the array.shape method as so.



In [3]:

    
timeseriesData.shape









    Out[3]:





(2, 1000)

From the output of this function, we know that timeseriesData is a 2x1000 array (two rows, 1000 columns).

This is an example of a 2-dimensional array, also known as a matrix.

The first row of timeseriesData gives the time stamp of when each measurement was taken, while the second row gives the measured value of the brightness at that time.

For ease of handling this data, one can in principle take each of these rows and create new arrays out of them. Let's do just that.

Since timeseriesData is 2-dimensional, each element has two indices.



In [4]:

    
t = timeseriesData[0,:] # this represents the time
signal = timeseriesData[1,:] # this represents the brightness

By convention, we first specify the row index followed by the column index.

array_name[n,:] is the n-th row, and all columns within that row.
array_name[:,n] is the n-th column, and all rows within that particular column.

Now then, let's see what the data looks like using the plot() function with t as your x-axis and signal as your y-axis.



In [10]:

    
%matplotlib nbagg
import matplotlib.pyplot as plt

#Your code goes here
plt.close()
plt.plot(t,signal)









    














    











    Out[10]:





[<matplotlib.lines.Line2D at 0x1185e3438>]

Looking at our data, you see clear spikes that jump well above most of the signal. (I've added this to the data to represent outliers that may sometimes appear when you're messing with raw data, and those must be dealt with). In astronomy, you sometimes have relativistic charged particles, not from your source, that hit the detector known as cosmic rays, and we often have to remove these.

There are some very complex codes that handle cosmic rays, but for our purposes (keeping it easy), we're going to just set a hard cut off of, let's say 15.

In order to do this, we can use conditional indexing in place of normal indices. This involves taking a conditional statement (more on those later) and testing whether it evaluates to True on each element in the array.

This gives an array of True/False, which we can use as logical indices to select only the entries for which the logical statement is True.



In [6]:

    
cutOff = 15.
signalFix = signal[signal < cutOff]

In this case, the conditional statement that we have used is signal < cutOff.

Here, conditional indexing keeps the data that we have deemed "good" by this criteria. We can also do the same for the corresponding time stamps, since t and signal have the same length.



In [7]:

    
tFix = t[signal < cutOff]

Now let's plot it. You try.



In [11]:

    
#Your code goes here
plt.close()
plt.plot(tFix,signalFix)









    














    











    Out[11]:





[<matplotlib.lines.Line2D at 0x11863fa58>]

Now that you have your data all cleaned up, it would be nice if we could save it for later and not have to go through the process of cleaning it up every time. Fear not! Python has you covered.

There are two formats that we are going to cover, one that is Python-specific, and the other a simple text format.

First, we must package our two cleaned up arrays into one again. This can be done simply with the np.array() function.



In [12]:

    
dataFix = np.array([tFix,signalFix])

Then, we can use either the np.save() function or the np.savetxt() function, the first saving the array into a '.npy' file and the other, into a '.txt' file. The syntax is pretty much the same for each.



In [13]:

    
np.save('./lecture2_data/dataFix.npy',dataFix)
np.savetxt('./lecture2_data/dataFix.txt',dataFix)

Now that your data files are saved, you can load them up again, using np.loadtxt() and np.load() for .txt and .npy files respectively. We used np.loadtxt() above, and np.load works the same way. So, let's load in the .npy file and see if our data was saved correctly.



In [ ]:

    
data = np.load('./lecture2_data/dataFix.npy')
t = data[0,:]
signal = data[1,:]
plt.close()
plt.plot(t,signal)
plt.show()

Now, let's see if you can do the same thing, but with the .txt file that we saved.



In [ ]:

    
#Your code goes here

D. Loading data files automatically

Let's combine what we learned about loops to make our data workflow more efficient. Suppose we have a set of data saved in separate text files that we would like to load automatically. For example, in ./lecture2_data/ you will find files c1.dat, c2.dat, c3.dat, c4.dat, c5.dat, c6.dat.

Rather than loading each of these files individually, you can use a for (or while) loop, constructing a string at each iteration corresponding to each of these files.

In Python you can use + to concatenate strings together. Here's an example.



In [20]:

    
'5'*5









    Out[20]:





'55555'



In [21]:

    
first_string = 'a'
second_string = 'b'
print(first_string + second_string)

ab

You can also cast an integer to a string using the str command.



In [22]:

    
first_string = 'a'
second_string = str(1)
print(first_string + second_string)

a1

Now you try

Your goal in this task is to write some code to load in this set of 6 .dat files as numpy arrays.

First, we will define an empty list (call it datalist) that will store the data.



In [27]:

    
datalist = []

This is an odd idea, defining an list variable without any elements, so instead think of it as a basket without anything inside of it yet. We will use the append() function to fill it.

Next, we call np.loadtxt on a single .dat file and add it to datalist using the command

datalist.append(loadedFile)

where loadedFile is the variable we've assigned the file to after loading it in.



In [28]:

    
loadedFile = np.loadtxt('./lecture2_data/c1.dat')
datalist.append(loadedFile)

Now it's your turn. Can you figure out how to load the rest of the data files into datalist automatically?

Hint: The names of the remaining files are are c2.dat, c3.dat, c4.dat, c5.dat, and c6.dat. What is the only thing that changes among these names? Can you think of a way to generate this part separately and combine it with the rest of the string?



In [ ]:

    
# Your code here

So, to summarize, not only can you manipulate arrays, but now you can save them and load them. In a way, those are some of the most important skills in scientific computing. Almost everything you'll be doing requires you know this, and now that you've mastered it, you're well on your way to being an expert in computational physics and astronomy!