Carnegie Python Bootcamp

Welcome to the python bootcamp. This thing you're reading is called an ipython notebook and will be your first introduction to the Python programming language. Notebooks are a combination of text markup and code that you can run in real time.

Importing what you need

It is often said that python comes with the batteries included, which means it comes with almost everything you need, bundled up in seperate modules. But not everything is loaded into memory automatically. You need to import the modules you need. Try running the following to import the antigravity module. Just hit CTRL-ENTER in the cell to execute the code


In [ ]:
import antigravity

You will find that python is full of nerdy humour like this. By the way, python is named after Monty Python's Flying Circus, not the reptile. Let's go ahead and import some useful modules you will need. Use this next command cell to import the modules named os, sys, and numpy. You can import them one-by-one or all at once, separated by commas.


In [ ]:
# Use this command box to run your own commands.
# By the way, Python ignores anything after a hash (#) symbol, so good for comments

The os module gives you functions relating to the operating system, sys gives you information on the python interpreter and your script, and numpy is the mathematical powerhorse of python, allowing you to manipulate arrays of numbers.

Once you import a module, all the functions and data for that module live in its namespace. So if I wanted to use the getcwd function from the os module, I would have to refer to it as os.getcwd. Try running getcwd all by itself; you'll get an error. Then do it the correct way:


In [ ]:

As you can probably guess, os.getcwd gets the current working directory. Try some commands for yourself. The numpy module has many mathematical functions. Try computing the square root (sqrt) of 2. You can also try computing $\sin(\pi)$ (numpy has a built-in value numpy.pi) and even $e^{i\pi}$ (python uses the engineering convention of 1j as $\sqrt{-1}$).


In [ ]:

There can also be namespaces inside namespaces. For example, to test if a file exists, you would want to use the isfile() function within the path module, which is itself in the os module. Give it a try:


In [ ]:

Here is a non-comprehensive list of modules you may find useful. Documentation for all of them can be found with a quick google search.

  • os, sys: As mentioned above, these give you access to the operating system, files, and environment.
  • numpy: Gives you arrays (vectors, matrices) and the ability to do math on them.
  • scipy: Think of this as "Numerical Recipes for Python". Root-finding, fitting functions, integration, special mathematical functions, etc.
  • pandas: Primarily used for reading/writing data tables. Useful for data wrangling.
  • astropy: Astronomy-related functions, including FITS image reading, astrometry, and cosmological calculations.

Getting Help

Almost everything in python has documentation. Just use the help() function on anything in python to get information. Try running help() on a function you used previously. Try as many others as you like.


In [ ]:

Variables and Types

A variable is like a box that you put stuff (values) in for later use. In Python, there are lots of different types of variables corresponding to what you put in. Unlike other languages, you don't have to tell python what type each variable is: python figures it out on its own. To put the value 1 into a box called x, just use the equals sign, like you would when solving a math problem.


In [ ]:
x=1

Now you can later use this value. Try printing out x using the print() function. You can also modify the variable. Try adding or subtracting from x and print it out its value again.


In [ ]:
print(x)
x=x+1
print(x)

But be careful about the type. Try assigning the value of 1 to y. Now divide it by two and print out the result.


In [ ]:
y = 1
y = y/2
print(y)

Here is where we get to the first major difference bewteen python 2 and python 3. In python 2, an integer divided by another integer is kept as an integer (this is simlar behavior to most other programming languages), so 1 divided by 2 is 0. In python 3.x, division always produces a real number (called a float), so 1 divided 2 is 0.5. If you want integer division in python 3, use // instead. This kind of division also works in python 2, so it's worth getting used to it.

Repeat what you did before, but this time start by assigning the value of 1.0 to y first. Also try using integer division on a float variable.


In [ ]:

Remeber that variables are simply containers (or labels if you prefer). They don't have a fixed type. Try using the type() function on y.

There are other types of variables. The most commonly used are strings, lists, and arrays. But literally anything can be assigned to a variable. You can even assign the numpy module to the variable np if you don't like typing numpy all the time. Try it out.


In [ ]:

Now you can use np.sqrt() rather than numpy.sqrt(). Most python programmers do this with commonly-used modules. In fact, we usually just use the special form: import numpy as np.


In [ ]:

Strings

Strings are collections of alpha-numeric characters and are the primary way data are represented outside your code. So you will find yourself manipulating strings in Python ALL THE TIME. You might need to convert strings from a text file into python data ojects to work with or you might need to do the opposite and generate an output file with final results that can be read by humans. Below are the most common things we need.

Strings are enclosed in matching singe (') or double ("), or even triple (''') quotes. Python doesn't distinguish as long as you match them consistently. Triple-quoted strings can span many lines and are useful for literal text or code documentation.


In [ ]:
# Strings are enclosed by single or double quotes
s='this is a string'
print (s)

We can use the len() function to determine the length of the string. Try this out below.


In [ ]:

Strings are very similar to lists in python (we'll see that below). Each alpha-numeric character is an element in the string, and you can refer to individual elements (characters) using an index enclosed in square brackets ([]). You can also specify a range of indices by separating them with a colon (:). Python indexes from 0 (not 1), so the first element is index [0], the second [1] and so on. Negative indices count from the end of the string. Try printing out the 2nd character of your string s, then the whole string except for the first and last characters.


In [ ]:

Specifying a range of indices (as well as more complicated indexing we'll see later) is called slicing. There is also a string module that contains many functions for manipulating strings.

Formatting strings

Sometimes you'll need your integers and floats to be converted into strings and written out into a table with specific formats (e.g., number of significant figures). This involves a syntax that's almost a separate language itself (though if you've used C or C++ it will be very familiar). Here is a good reference: https://pyformat.info/

We'll cover the most important. First, if you just print out a regular floating point number, you get some arbitrary number of significant figures. The same is true if you just try to convert the float to a string using str(), which takes any type of variable and tries to turn it into a string. Try printing the string value of np.pi.


In [ ]:

If you only want two significant figures, or you want the whole number of span 15 spaces (to make nicely lined-up colums for a table), you need to use a format string. A format string is just like a regular string, but has special place-holders for your numbers. These are enclosed in curly brackets ({}) and have special codes to specify how to format the variable. Without any other information, a simple {} will be replaced with whatever str() produces for the variable. For more control over numerical values, specify :[width].[prec]f for floats and :[width]d for integers. Replace [width] and [prec] with the total width you want your number to occupy and the number of digits after the decimal, respectively. Here's an example:


In [ ]:
fmt = "This is a float with two decimals:  {:.2f}"
print (fmt)

By itself, the format string didn't do much. In order to inject the number, we use its .format() function.


In [ ]:
# two decimal places
print(fmt.format(x))

You may want to make a format string with more than one number or string. You can do this by specifying multiple format codes and inject an equal number of arguments to the .format() function.


In [ ]:
fmt = "Here is a float: '{:.2f}', and another '{:8.4f}', an integer {:d}, and a string {}"
print (fmt.format(x, np.pi, 1000, 'look ma, no quotes!'))

We decided to show the new style of string formatting, as it is the python 3 way of the future and is more powerful than the old style. Both styles are suppored in both versions of python and the reference above has plenty of examples of both.

Lists and numpy arrays

Lists contain a sequence of objects, typically, a sequence of numbers. The are enclosed by square brackets, (e.g., []), and each item separated by a comma. Here are some examples.


In [ ]:
# A list of floats
x1=[1.,2.,7.,2500.]
print(x1)

In [ ]:
# try making a list of strings. Use indexing to print out single elements and slices.

Lists can also contain a mixture of different types (even lists). Basically, anything can be an element of a list. Try making a list of strings, floats, and the list x1 above. Print it out to see what it looks like. Can you guess how to refer to an element of a list that's in another list?


In [ ]:

Numpy arrays allow for more functionality than lists. While they may also contain a mix of object types, you will primarily be working with numpy arrays that are comprised of numbers: either integers or floats. For example, you will at some point read in a table of data into a numpy array and do things to it, like add, multiply, etc.

Above, we imported the numpy module as np. We will use this to create arrays.


In [ ]:
x=np.array([1.,2.,3.,4.])
print(x)

Try adding an integer to x and print it out. Then try other mathematical functions from the numpy module on it.


In [ ]:

Here is where the real power of numpy arrays comes into play. We can use numpy to carry out all kinds of mathematical tasks that in other programming languages (like C, FORTRAN, etc) would require some kind of loop. Here are some of the most common tasks we'll use. By using numpy functions on arrays of numbers, we speed up the code a lot. This is commonly referred to as vectorizing your code.

Array creation

There are many functions in numpy that allow you to make arrays from scratch.

We can create an array of zeros:


In [ ]:
x1=np.zeros(5)
print(x1)

Take a guess at how to create a 5-element array of ones.


In [ ]:

Now suppose you want all the elements to be equal to np.pi. How could you do that as a one-liner. Hint: vectorize.


In [ ]:

We can create a sequence of numbers using np.arange(start,stop,step), where you specify a number to start (inclusive), when to stop (non-inclusive), and what step size to have between each element. Make an array called x1 using arange that goes from 0 to 4 inclusive.


In [ ]:

Now make an array called x2 that goes from 0 to 10 in steps of 2.


In [ ]:

Another handy function is np.linspace(start,stop,N), which gives you a specified number N of elements equally spaced between start and stop. The stop value in this case is inclusive (will be part of the sequence). Try making an array called x3 that goes from 0 to 8 and has 5 elements.


In [ ]:

One can also create N-dimensional numpy arrays. For example, images that you read into Python will typically be stored into 2D numpy arrays.


In [ ]:
x=np.ones((4,2))
print(x)
print(x.shape)

Array Math

Just as with spreadsheets, you can do math on entire arrays using operators (+,-,*, /, and **) and numpy functions.

Try doing some math on your arrays x1, x2, and x3.


In [ ]:
x4=0.5+x1+x2*x3/2.
print(x4)

What happens if you try to add x1 to the matrix x you created above. Give it a try:


In [ ]:

We also have access to most mathematical functions you'll need. Try raising an array to 3rd power using np.power. Take the base-10 log of an array and make sure it gives you what you expect. A shorthand for raising to a power is **, for example, 2**3=8.


In [ ]:

Matrix Math

numpy can treat 2D arrays as matrices and leverage special compiled libraries so that you can do linear algebra routines easily (and very fast). Here, we construct a rotation matrix and use dot do do the dot-product. Try changing theta to different angles (in radians!).


In [ ]:
# CCW rotation by 180 degrees
theta=np.pi/4
# Rotation matrix
x=np.array([[np.cos(theta),-np.sin(theta)],
            [np.sin(theta),np.cos(theta)]])
print(x)

# Lets rotate (1,1) about the origin
y=np.array([1.,1.])
z=np.dot(x,y)

print(y)
print(z)

The other common matrix tasks can also be done with either numpy functions, numpy.linalg, or member functions of the object itself


In [ ]:
# Taking the transpose
x_t = x.T
print(x_t)

# Computing the inverse
x_i = np.linalg.inv(x)

# Matrix Multiplication
I = np.dot(x_i,x)
print (I)

Array Slicing

Often, you need to access elements or sub-arrays within your array. This is referred to as slicing. We can select individual elements in an array using indices just as we did for strings (note that 0 is the first element and negative indices count backwards from the end). The most general slicing looks like [start:stop:step]. Below, we create an array. Try to print out the following using slices:

  • the first element
  • the last element (there's two ways to do this)
  • a sub-array from 3rd element to the end
  • a sub-array with the last element stripped
  • a sub-array with a single element (the last)
  • a sub-aray with every second element
  • a sub-array with all elements in reverse order

In [ ]:
x=np.arange(5)
print(x)

We can also "slice" N-dimensional arrays. Note that we have used reshape to transform a 1D array into a 2D array with the same total number of elements. This is another handy way to create N-dimensional arrays.


In [ ]:
x=np.arange(8)
print(x)
x=x.reshape((4,2))
print(x)

print(x[0,:])
print(x[:,0])

The reverse of reshape is ravel, which flattens a multi-dimensional array into a 1D array. Try this on x.


In [ ]:

Control blocks

So far, we've been running individual sets of commands and looking at their results immediately. Later, we will write a complete program, which is really just bundling up instructions into a recipe for solving a given task. But as the tasks we want to perform become more complicated, we need control blocks. These allow us to:

  • Repeat tasks again and again (loops)
  • Perform tasks only if certain conditions are met (if-else blocks)
  • Group instructions into a single logical task (user-defined functions)

python is rather unique in that it uses indenting to indicate the beginning and end of a logical block. This actually forces you to write readable code, which is a really good thing!

For loops

for loops are useful for repeating a series of operations a given number of times. In python, you loop over elements of a list or array. So if you want to loop over a sequence of integers (say, the indices of an array), then you would use the range() function to generate the list of integers. You might also use the len() function if you need the length of the array.


In [ ]:
# range(n) creates a list of n elements
print(range(5))

# We can use it to iterate over a for loop
for ii in range(5):
    print(ii)

Notice that after the line containing the for statement, the subsequent lines are indented to indicate that they are all part of the same block. Every line that shares the same indenting will be repeated 5 times.

You can use a for loop to build up a list of elements by appending to an existing list (using its append() member function). For example, to create the first N elements of the Fibonacci sequence:


In [ ]:
fib = [1,1]
N = 100
for i in range(N-2):
    fib.append(fib[-2]+fib[-1])
print (fib)

You may notice (if N is large enough) that you get numbers that have an L at the end. python has a special type called a long integer which allows for abitrarily large numbers.

Here is an example of what not to do with for loops (if you can help it). For loops are more computationally expensive in python than using the numpy functions do do the math. Always try to cast the problem in terms of numpy math and functions. It will make your code faster. Try making N in the following example larger and larger and you'll see the difference.


In [ ]:
# A slightly more complex example of a for loop:
import time
N = 100
x=np.arange(N)
y=np.zeros(N)

t1 = time.time()   # start time
for ii in range(x.size): 
    y[ii]=x[ii]**2
t2 = time.time()   # end time
    
print ("for loop took "+str(t2-t1)+" seconds")

Another way to implement the stopwatch is to use the iPython "magic" command %time:


In [ ]:
%time for ii in range(x.size): y[ii]=x[ii]**2

Try making N in the above code block bigger and see how the execution time goes up. Now do the exact same thing in the next code block, but use numpy functions without a loop. See how the execution time improves.


In [ ]:

While loops

Similar to a for-loop, a while-loop executes the same code block repeatedly until a condition is no longer true. These are handy if you don't know ahead of time how long a loop will take, but you know you have to stop when a condition is true (or false). As an example, we can estimate the smallest floating point number (called machine-$\epsilon$) by continually dividing by 2 until we get zero.


In [ ]:
Ns = [1.]
while Ns[-1] > 0:
    Ns.append(Ns[-1]/2)
print Ns[-10:]

Beware, though, that unlike for loops, you can have a never-ending loop if the condition is never false. Your computer is happy to keep grinding away forever. How could you safeguard against this?

If blocks

if statements act as a way to carry out a task only if a certain condition is met. As with the for loops above, everything that is indented after the if statement is part of the block of code that is executed if the condition is met. Likewise for the else block.


In [ ]:
x=5
if x==5:
    print('Yes! x is 5')
    
# The two equal signs evaluate whether x is equal to 5.  One can also use >, >=, <, <=, != (not equal to)

Note the two different uses of the equals sign: assignment to a variable (=) and the logical comparison of two objects (==).


In [ ]:
x=5
if x==3:
    print('Yes! x is 3')

if-else statements execute the code in the if block if the condition is true, otherwise it executes the code in the else block:


In [ ]:
x=5
if x==3:
    print('Yes! x is 3')
else:
    print('x is not 3')
    print('x is '+str(x))

One can also have a series of conditions in the form of an elif block:


In [ ]:
x=5

if x==2:
    print('Yes! x is 2')
elif x==3:
    print('Yes! x is 3')
elif x==4:
    print('Yes! x is 4')
else:
    print('x is '+str(x))

You can also have multiple conditions that are evaluated:


In [ ]:
x=5

if x > 2 and x*2==10:
    print('x is 5')
    
if x > 7 or x*2 == 10:
    print('x is 5')

Exercise

Try this. Use a while loop and an if-else block to generate the Collatz sequence. Start the list with any positive integer. To get the next element of the list, check if the current element is even or odd. If even, divide it by 2. If odd, multiply by 3 and add 1. The sequence ends if you get to 1. Print out the length of the list. The Collatz conjecture states that the sequence will always convert to 1 eventually regardless of the starting integer. The proof of this conjecture is one of the great unsolved problems in mathematics.


In [ ]:

Functions

Functions allow you to make a bundle of python statements that are executed whenever the function is called. Each function has arguments you pass in and value(s) that are returned. For example, you've been using the print function. There are also some functions above that you have been using. Now we will make our own.


In [ ]:
# the function 'myfunc' takes two numbers, x and y, adds them together and returns the results
def myfunc(x,y):
    z=x+y
    return z

# to call the function, we simply invoke the name and feed it the requisite inputs:
g=myfunc(2,3.)
print(g)

In [ ]:
# you can set input parameters to have a default values
def myfunc2(x,y=5.):
    z=x+y
    return z

g=myfunc2(2.)
print(g)

g=myfunc2(2.,4.)
print(g)

Let's put the if, for and def together into one example. Take the code above that generates a Fibonacci sequence and put it into a function called Fibonacci. The function should take one argument (the length of the sequence). It should check if N is less than 2 (which can't be done), or if N is greater than 1000 (which would take a very long time). If these conditions are met, print an error statement and return None (a python special object that generally indicates something went wrong). Otherwise, compute and return the sequence.


In [ ]:

Give your function a test run. Make sure it behaves as it should.


In [ ]:
Fibonacci(1)
Fibonacci(1001)
Fibonacci(10)

Importing Python packages: examples

One of the major advantages of python is a wealth of specialized packages for doing common scientific tasks. Sure, you could write your own least-squares fitter using what we've shown you so far, but before you attempt anything like that, take a little time to "google that" and see if a solution exists already.

You will have to import Python modules/packages to carry out many of the tasks you will need for your research. As already discussed, numpy is probably the most useful. scipy and astropy are other popular packages. Lets play around with a few of these to give you an idea of how useful they can be.

Random Numbers


In [ ]:
# I like to declare all of my imported packages at the top of my script so that I know what is available.
# Also note that there are many ways to import packages.

import numpy.random as npr                   # Random number generator
from scipy import stats                      # statistics functions
import scipy.interpolate as si               # interpolation functions
from astropy.cosmology import FlatLambdaCDM  # Cosmology in flat \Lambda-CDM universe

In [ ]:
# Random numbers are useful for many tasks.

# draw 5 random numbers from a uniform distribution between 0 and 1:
x1=npr.uniform(0, 1, size=5)
print(x1)

# draw 5 random numbers from a normal distribution with mean 10 and standard
#  deviation 0.5:
x2=npr.normal(10, 0.5, size=5)
print(x2)

# draw 10 random integers between 0 and 5(exclusive)
x3=npr.randint(0,5,10)
print(x3)

Now you try. Generate a random array of numbers drawn from a Poisson distribution with expectation value 10. Check that the mean of the array is approximately 10 and the standard deviation is approximately sqrt(10).


In [ ]:

Here is a practical example of using random numbers. Often in statistics, we have to compute the mean of some population based on a limited sample. For example, a survey may ask car drivers their age and make/model of car. A marketing team may want to know the average age of drivers of Ford Mustangs so they can target their audience. Calculating a mean is easy, but what about the uncertainty in that mean? You could compute the population standard deviation, but that pre-supposes the underlying distribution is Gaussian. Another method, that does not make any assumptions about the distribution is bootstrapping. Randomly remove values from the data and replace them with copies of other values. Compute a new mean. Do that N times and compute the standard deviation of these bootstrapped mean values.

Below is an example of bootstrapping a sample to determine the uncertainty on a measurement. In this case, we will compute the mean of a sample of ages, and the uncertainty on the mean.


In [ ]:
# x below represents a measurement of the ages of N people, where N=x.size
x=np.array([19.,20.,22.,19.,21.,24.,35.,22.,21.])
# This is the mean age:
print(np.mean(x))

# Now we "bootstrap" to determine the error on this measurement:
ntrials=10000 # number of times we will draw a random sample of N ages
x_arr=np.zeros(ntrials) # store the mean of each random sample in this array

for ii in range(ntrials):
    # draw N random integers, where N equals the number of samples in x
    ix=npr.randint(0,x.size,x.size)
    # subscript the original array with these random indices to get a new sample and compute the mean
    x_arr[ii]=np.mean(x[ix])

# Finally, compute the standard deviation of the array of mean values to get the uncertainty on the *mean* age
print(np.std(x_arr))

Binning Data

Another common statistical task is binning data. This is often done on noisy spectra to make "real" features more visible. In the following example, we are going to bin galaxies based on their redshift and compute the mean stellar mass for each bin.


In [ ]:
# This is an example of binning data and computing a particular value for the data in each bin.
# The scipy package is used to carry out this task.

# Lets make some fake data of galaxies spanning random redshifts between 0<z<3:
z=npr.rand(10000)*3.
# And these galaxies have random stellar masses between 9<log(M/Msun)<12:
m=npr.rand(10000)*3.+9.

# Now we want to compute the median stellar mass for galaxies at 0<z<1, 1<z<2, and 2<z<3:
# So lets declalre the bin edges
bins=[0.,1.,2.,3.]

m_med,xbins,btemp = stats.binned_statistic(z,m,statistic='median',bins=bins)
print(bins)
print(m_med)

Interpolation

In science, we measure discrete values of data. Sometimes you need to interpolate between two (or more) points. A common example is drawing a smooth line through the data when making a graph. You could do this by hand using numpy, but the module scipy has an entire interpolation package that offers an easy solution.


In [ ]:
# Interpolating between data points is another common task.  We'll again use scipy to do some interpolating:

x=np.arange(5.)
y=x**2

print(x)
print(y)

# Linear interpolation
f=si.interp1d(x,y,kind='linear')
# si.interp1d returns a function, f, which can be used to feed values to.
# For example, lets evaluate f(x)
print(f(x))
# And now a different value
print(f(0.5))

# We can employ a higher order interpolation scheme to get more precise results (assuming a smoothly varying function)
f=si.interp1d(x,y,kind='quadratic')
print(f(x))
print(f(0.5))

Astronomy-specific Example

Now we get into a really specific case. Astropy is a collection of several packages that are very useful for astronomers. It is actively developed and has new stuff all the time. Here, we show how you can use the cosmology calculator to compute the age of the Universe given a redshift.


In [ ]:
# The astropy package has all kinds of astronomy related routines.
# Here, we define a cosmology that allows to compute things like 
# the age of the universe or Hubble constant at different redshifts

cosmo = FlatLambdaCDM(H0=70., Om0=0.3)
redshift=0.
print(cosmo.age(redshift))

redshift=[0,1,2,3]
print(cosmo.age(redshift))

Running code

There are many environments in which one can run Python code:

  • iPython notebooks like this one are good for running quick snippets of code.
  • Spyder (provided with Anaconda) provides a space for writing scripts, executing them, and also for easily looking up definitions of different functions. Very similar to the IDL graphical IDE.
  • One can also write code in a plain text editor, like Emacs/Aquamacs. Then execute the code in a terminal running Python or iPython.

Running code from the command-line

This is the most common and agnostic way to run your code. If you send your code to someone else, assume they will run it from the command line. If you are running your code on an HPC cluster, it needs to be run from the command-line. Lastly, writing code that runs with minimal user-interaction makes it more repeatable.

There are two aspects of writing command-line code that you should be familiar with: 1) getting arguments from the command line; and 2) working with files. The first is done through the sys.argv variable, the second is done with os.path package. Here we look at each briefly.

sys.argv

Quite simply, this is a list of the command-line arguments. You've seen several unix commands. Suppose we wanted to write the equivalent of the cp command, but using a python script. Usually, you run the command like this from the command-line:

cp file1 file2

That would copy file1 to file2. So our python script will need to get both the source and destination file name. Here is how I would write a simple script to do the same thing as cp:

import sys
f1 = sys.argv[1]
f2 = sys.argv[2]
print ("copying %s to %s" % (f1,f2))

# Here would be the code to actually copy one file to the other

Note: you can try to print out sys.argv on the next command-block. It will show you how this ipython notebook was actually run. But it is not very useful for doing anything practical.

With more complicated code, your command-line arguments may also get rather complicated (you may have optional arguments, switches, etc). There is a really good module in python for dealing with such complicated arguments so that your script isn't filled with code just to deal with parsing sys.argv. Have a look at the argparse module when you get to the point where you need to deal with complicated command-lines arguments.


In [ ]:
print sys.argv

os.path

Up until now, we have been printing output to the screen. You can still do that with command-line scripts, but once you close down the terminal, that output is lost. Your code will need to write to output files, but likely will also have to read from files. os.path gives you some functions that are useful when dealing with files.

You may have to check to see if a file exists, if a folder exists, etc.


In [ ]:
# Check to see if a file exists
if os.path.isfile('/bin/ls'):
    print("Oh good, you can list files")

if not os.path.isfile('test_output.dat'):
    print ("It is safe to use this")
    
# Check if a folder exists
if os.path.isdir('/tmp'):
    print ("you have a tmp folder")
    
# construct the path to a file using the correct separator
of = os.path.join('tmp','some','file')
print (of)

File access with open

In order to access the contents of an existing file, or write contents into a new file, you use the open function. This will return a file object that you can use to read or write data. Here are some common tasks. First, let's write data to a file:


In [ ]:
# Open a file for writing (note the 'w')
f = open('a_test_file', 'w')
# write a header, always a good idea!
f.write("# This is a test data file. It contains 3 columns\n")
for i in range(10):
    f.write("%d %d %d\n" % (i, 2*i, 3*i))
# We need to "close" the file to make sure it is written to disk
f.close()

Now, depending what your current working directory is (see beginning of tutorial), there will be a file called a_test_file in that folder. If you like, have a look at it using an editor or file viewer. It will have 10 rows and 3 columns. Note that we needed to have a "newline" (\n) at the end of the string we used in the write() function, otherwise the output would have been one long line.

Now, let's read the file back into python. You can either read the whole thing in at once as a single string, read it in line-by-line or read all lines in at once as a list.


In [ ]:
# This time we use 'r' to indicate we only want to read the file
f = open('a_test_file', 'r')
everything = f.read()

# This brings us back to the beginning of the file
f.seek(0)
one_row = f.readline()

f.seek(0)
list_of_rows = f.readlines()
f.close()
print(list_of_rows)

Use print() to have a look at the data we read in. Note that the rows will include the newline character (\n). You can use the string.strip() function to get rid of it.

In the next tutorial (visualization), we'll show you a better way to read in standard format data files like this. But sometimes you'll be faced with data that these more automatic functions can't handle, so it's good to know how to read it in by hand.

Debugging

You almost never write a script without introducing bugs (the term comes from when computers were mechanical machines and insects literallly interfered with the running of the code). Luckily, python gives a very nice "traceback" report when it encounters a problem with what you've written. let's just generate a mistake on purpose and see what happens.


In [ ]:
x1=np.arange(5)
x2=np.arange(3)
print(x1)
print(x2)
print(x1+x2)

This traceback is pretty short, but it shows exactly where the problem occurs (indicated with ---->). And the explanation is pretty clear (when you're used to numpy arrays). You can't add two arrays unless they have the same shape.

These kinds of compile-time bugs are the easiest to fix. You've written something wrong (syntax error), or you've done some illegal operation (like above) and your code grinds to a halt. You fix the problem and then find another and so on until your code runs.

But then you might not get the "right" answer. Or your code does something unexpected. Or you get a "divide by zero" error and it's not completely obvious where things went wrong. That's when you need to do real debugging. There are several approaches to dealing with this:

  • Print the values of variables throughout the program by simply injecting "print statements" in your code. This is easy to do and works "anywhere". Usually, with a few of these you can see the dumb mistake you made (but python didn't catch).
  • Use a debugger. Spyder has several debugging tools. You can set checkpoints (where you want your code to stop) and examine the values of variables. But you need to load your program into Spyder and run it from there.
  • If you want to run your script in the Python interpreter and have it leave all the variables available after a run:
    • >>> exec(open("myscript.py").read(), globals())

Final exercise

(if time permits)

To wrap this all up, we're going to leave the notebook and get you to write a stand-alone script that can be run from the command-line. The script should:

  • get the name of a file from command-line argument
  • open the file and read its contents (columns of numbers)
  • convert columns in the file into numpy arrays,
  • compute and report the mean of each column. Try to use the concepts we've covered in this tutorial (e.g. make a function that reads columns and converts to arrays). Run the script on the file we created earlier (a_test_file).

References

  • Carnegie python links:

  • Python "experts" at Carnegie

    • Shannon Patel (patel@carnegiescience.edu)
    • Chris Burns (cburns@carnegiescience.edu)
    • Eduardo Banados (#205)
  • Google. Chances are someone else had the same question you do, asked it, and had it answered on stackoverflow.