Gettings Started

G. Richards, 2016

This notebook contains most everything that we need to get started. It draws heavily from classes taught by Andy Connolly (http://cadence.lsst.org/introAstroML/) and Karen Leighly (seminar.ouml.org).

Installing Git

Things will be a lot easier if you can download the file that we are currently viewing (which is a jupyter notebook---more on that later). Then you won't have to try to read what is on the screen at the front of the room---you can just read your own screen. More importanly you can interact directly with the notebook. We will do the same as the course goes on.

So, we will use a repository tool called 'Git' that will let you download the notebooks that I put into the repository before each class. If you miss a class your homework will be to "complete" the notebook before the next class.

If you want, you can make a account on GitHub, where you can create your own repositories. But for now, you just need to have Git installed on your machine so that you can use it to 'pull' files from the class repository.

To get Git for any platform, see: https://git-scm.com/download/.

If you are using the Newton cluster machines, Git is already installed.

Once Git is installed, make a place where you are going to put the class repository. Perhaps in a git subdirectory if you think that you might use more repositories later (we might even for this class!).

[~]% cd ~
[~]% mkdir git
[~]% cd git

Now we need to "clone" the class repository into your account

[~]% git clone https://github.com/gtrichards/PHYS_T480.git

This will make a subdirectory called "PHYS_T480" in which you will see a file called "InitialSetup.ipynb", which is the file that you are reading now!

If you already know what a jupyter (or IPython) notebook is, go ahead and open the file with jupyter. Otherwise, hang on and we'll get to that.

Before each class, you will want to update this repository so that you get any new files that I have put there for you. Do that with

[~]% cd ~/git/PHYS_T480
[~]% git status
[~]% git pull

Setting Up Your Computer

Everyone will need to have the proper software environment in order to be able to participate in the course. This can either be on the newton cluster machines or on your laptop. Which is up to you (but we don't have enough computers for everyone!)

The software requirements are as follows: Anaconda Python 2.7.X (not 3.X unless you know how to deal with the differences)

NumPy version 1.11+: efficient array operations
SciPy version 0.17+: scientific computing tools
matplotlib version 1.5+: plotting and visualization
jupyter version 1.0+: interactive computing (formerly IPython)
Scikit-learn version 0.17+: machine learning
astroML: an astronomical machine learning toolkit

Anaconda Installation

If you don’t already have Anaconda Python 2.7.X, then go to https://www.continuum.io/downloads and download the installer of your choice (e.g., the graphical OSX installer. Again for Python 2.7, not 3). Don't worry--it's free. But by all means, have them send you the cheat sheet if you are asked.

Open the install package and follow the instructions, installing for you only.

Open a new terminal window, and make sure your $PATH variable points to the Anaconda installation. You can do this by typing

[~]% which python

The result should show the path to the newly-installed anaconda folder. If not, you must modify your $PATH variable to point to the anaconda directory as follows:


Add Anaconda Python to the path

I use TCSH, so I added the following to my .tcshrc file:
set path=(/Users/gtr/anaconda/bin $path)

If you use BASH instead, you’ll have to add the following to your .bashrc file:

PATH="/home/newton3/gtr/anaconda2/bin:$PATH"

For good measure (since I have about 12 python installations on my computer), I usually create an alias for Anaconda python so that I know that I’m starting that particular python if I’m using the command line.

alias apython '/Users/gtr/anaconda/bin/python'


Now either close that terminal window and open a new one, or type

[~]% source .tcshrc

or [~]% source .bashrc

then

[~]% which python

Hopefully all is well now.


Update your Anaconda distribution

You can (and should) update your anaconda distribution with the following command.

[~]% conda update conda

Note that conda is the package management system that comes with anaconda. Do this once a week.

I’m currently running Python 2.7.12 via conda-4.2.7. You don’t have to be doing exactly the same, but if you run into problems that I am not having, that would be a good place to start debugging.

Conda will have installed most of the the other software packages listed above. The exceptions are noted below.


Installing astroML

astroML is the software library that goes together with the book. Much of it is a series of wrappers to scikit-learn, where Jake VanderPlas is one of the main contributors to each. Jake has written an Intro to AstroML paper that is worth going through to give you an idea of the sorts of things that we'll be doing in the class.

The astroML packages and add-ons can be installed using pip (the python package installer):

[~]% pip install astroML

then

[~]% pip install astroML_addons

Installing other things

We'll need a few other packages installed as we go along. Here are the ones that I remembered to write down. You can either install them now, or you can do it later (when you get an error message!).

[~]% conda install -c astropy astroquery

[~]% pip install corner

Getting Started with Python

The course requirements were setup up to ensure that you have at least some (extended) experience with Python. E.g., you might have taken the 113-114-115 series. If you don't know any Python at all, this class may be pretty tough going. However, many of you might need a refresher. A good place to start is Appendix A (more specifically A3) in the textbook.

I also recommend the codecademy course on Python. It is free and does a good job of walking you through things that you need to learn.

You might also be interested in A Student’s Guide to Python for Physical Modeling: Chapters 1-3 provide a good introduction to Python before getting into the "physical modeling" part. It encourages the use of spyder, which is a matlab-like interactive interface for Python.

And, of course, you can google 'python tutorial' (or some such) and find a plethora of things that you may or may not like better.


Interfacing with Python

You can interface with Python by 1) starting it on the command line: python; 2) starting an "interactive" Python interface IPython: ipython, which provides lots of built in functionality through pre-defined scripts; or 3) using a web-based interactive interface to IPython, which used be called an 'ipython notebook', but is now called jupyter notebook.

We'll be using the latter. In fact, each lecture will be in the form of a jupyter notebook that you'll download from the github repository before the start of each class. Indeed, now that we have everything installed, you should be able to go to your PHYS_T480 directory and type jupyter notebook InitialSetup.ipynb & to open this file in your browser. (Chrome is recommended).

(Brief) Review of Python Basics

To start with Python uses the following numerical data types and operations

We will encounter three main types of collections of data:

  • Lists: a mutable array of data
  • Tuples: ordered, immutable list
  • Dictionaries: keyword/value lookup

It is worth noting that python begins indexing at 0 and uses row-major order.

Tuple

  • denoted by parentheses, e.g., x=(5.0,7.0,9.0,11.0)
  • its most notable property is that it is immutable – after being defined, it cannot be changed
  • to index, use square brackets, e.g., print x[0]
  • can get part of one using, e.g., print x[2:] etc.

List

  • denoted with brackets, e.g,. y=[5.0,7.0,9.0,11.0]
  • in contrast to a tuple, it can be reassigned
  • to index, also use square brackets, e.g., print x[0]
  • can get part of one using, e.g., print y[2:] etc.

Dictionary

  • assigns a value to a key, for example z={'a':2,'b':4,'c':6}, where a,b,c are the keywords.
  • the dictionary is indexed by the keyword, e.g., print z['a']
  • they can be quite complicated.

Here is a lot more on data structures: https://docs.python.org/2/tutorial/datastructures.html.


Methods and Attributes

Each type of data structure has associated “methods”. A method is like a little built-in function that can be run on a data structure.

For example, open a new cell (or in a python terminal window) and do the following:

v=[27.0,35.0,101.0,57.0]
print v
v.sort()
print v

So $v$ has now been replaced with the sorted $v$.

Basic mathematical operations that can be applied to your data are found here: https://docs.python.org/2/library/math.html.

OK, hopefully that is a refresher for everyone, so let's get going with more complicated stuff. If you need more of a refresher than that, see the links above.

Open a jupyter notebook

[~]% jupyter notebook &

Will pop up a web page in your default browser and show the Dashboard where you can navigate to a particular notebook or open a new one. You can also open a notebook directly with

[~]% jupyter notebook InitialSetup.ipynb &

If you are creating a new notebook, click on 'new' at the top-right of the page and select 'python'. (N.B. If I get Chrome Cast working, I'll ask you to edit the jupyter defaults to use Google Chrome for class.)

Working with jupyter notebooks and IPython

Notebooks have 2 primary cell types: 'Markdown' and 'Code'. The Markdown cells are basically just for you to read. The Code cells are meant to be executed (perhaps after you have filled in some blanks).

To execute a cell in the notebook, type 'shift-return'. If you see a * in the square bracket to the left of the cell or a 'Busy' in the tab title, it means the command is in the process of running, and you need to wait for it to finish.

The notebook is autosaved, so that when you return to a notebook, everything is the same as you left it. If you want to reset it, you can do “revert to checkpoint”. If you save it yourself, you will create a checkpoint, and the original version will be unavailable.

Here are some useful IPython commands to get you started (# followed by text indicates a comment and not what you type)!

In [ ]: ?          # basic help function.  Pops open a sub-frame at the bottom of page.  
                     Close with "x".

In [ ]: %quickref  # Pops open a quick reference card

In [ ]: !          # spawning out to the operating system; 
                     e.g files=!ls will fill the variable files 
                     with a list of files in your directory.


IPython also contains a number of “magic” commands. Two examples are

In [ ]: %matplotlib inline  # makes plots within the web browser instead of popping up a new window

In [ ]: %whos               # lists the vectors, etc. that you have defined.

IPython also has lots of keyboard shortcuts – the main one is shift-enter to run a cell.

NumPy

NumPy is short for Numerical Python. It is the foundational package for scientific computing in Python. It is a library which will allow us to work with data structures called arrays that are more efficient for storing and manipulating data than other Python data structures (or C++ for that matter).

For example, cut and paste this into a new cell, then type shift-enter to run it:

import numpy as np
xlist = [1,2,3]
xarray = np.array(xlist)
twice_xarray = 2*xarray
print twice_xarray

This is far better than creating a for loop over the entries in xlist or even something fancier like a "list comprehension"

twice_xlist = [2*x for x in xlist]
print twice_xlist

Note the lack of commas in the array as compared to the list.

To load the Numpy library type:

In [ ]: import numpy as np

In fact, just plan on starting every new series of code cells with this!

The basic unit for numpy is an ndarray. See the link for examples of how to define, index, slice, etc. the array.

There are quite a few methods associated with arrays. Some useful ones include:

Method Property
np.zeros(5,float) yields a 5-element array of zeros of type float
a=np.empty(4) yields a 4-element empty array
a.fill(5.5) fills that array with 5.5 for all elements
np.arange(5) yields an integer array of length 5 with increasing values
b=np.random.normal(10,3,5) yields a 5 element array of normally distributed numbers with mean 10 and variance 3
mask=b > 9 creates a boolean array determing which numbers are greater than 9
print b[mask] prints the ones with values > 9
b[mask]=0 sets the ones > 9 to zero

Arrays can be multidimensional, e.g., c=np.random.normal(10,3,(2,4)), creates a 2 x 4 array with normally distributed numbers with mean 10 and variance 3.

More methods for multidimensional arrays:

Method Property
d=c[0,:] grabs the first (0th) row of c.
d=c[1,:] grabs the second (1st) row of c.
d=c[:,0] grabs the first column of c.
c.dtype data type
c.size total number of elements
c.ndim number of dimensions
c.shape shape or dimensionality
c.nbytes memory used (bytes)
c.min() gives the minimum of c
c.max() gives the maximum of c
c.sum() sum of all elements
c.mean() mean of all elements
c.std() standard deviation of all elements
c.sum(axis=0) will present sum along the 0th axis (column direction). The result will have reduced dimensionality

You can also operate with arrays, for example, adding them together, multiplying them, multiplying or adding a constant. There are, however, "broadcasting" rules so that you need to make sure you know what you are doing when dealing with arrays of different sizes.

SciPy

SciPy is an "open-source software for mathematics, science, and engineering". We 'import' it the same way that we import numpy:

In [ ]: import scipy


SciPy is a suite of tools for data analysis including integration, statistical functions, interpolation. It is built on top of Numpy. Where Numpy is intended for array manipulation, Scipy is intended for analysis. This is where many useful tools will be. It may be worth looking at the users guide to get an idea of the kinds of functions are available. N.B. For some packages you need to import more than just the main scipy package. So, for example to see what is available in the scipy.integrate package, do the following:

import scipy.integrate
scipy.integrate?

You can also make use of TAB completetion to see what is available

scipy.integrate.[TAB]

The catch is that, since we are using notebooks, you need to have imported FIRST, because otherwise the computer doesn't know what you are asking about yet.

Matplotlib

It is generally useful to be able to visualize your data. We will do that using the matplotlib library. Most of you should be familiar with it already, but we will likely be making some plots that are more complex than you are used to. One nice resource is this thumbnail gallery which you can use to figure out how to make a new plot.

Let's make sure that everything is working by making a simple plot


In [2]:
# magic command to make the figure pop up below instead of in a new window
%matplotlib inline 

# invoke pyplot in matplotlib, give it an alias
import matplotlib.pyplot as plt  
import numpy as np

x = np.linspace(0, 2*np.pi, 300)
y = np.sin(x)
plt.plot(x, y)
y2 = np.sin(x**2)
plt.plot(x, y, label=r'$\sin(x)$')
plt.plot(x, y2, label=r'$\sin(x^2)$')
plt.title('Some functions')
plt.xlabel('x')
plt.ylabel('y')
plt.grid()
plt.legend();


/Users/gtr/anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

You can change both the marker/line styles and colors. I highly recommend the use of the colors in the palettable library. You can see the colors at http://colorbrewer2.org/.

Try making some changes to the code above and see what happens.

Scikit-learn

The Scikit-learn library forms the core of computing tools that we will use for this class. The "scikit"s are add-ons to scipy. Scikit-learn is the add-on for machine learning.

It probably needs its own introductory tutorial as both the input and output may not be quite what you would have expected. You might want to spend some time going through their quick start guide, user's guide and tutorials here: http://scikit-learn.org/stable/documentation.html.

Ironically, Scikit-learn is NOT really intended for Big Data (despite the title of the course). We will be using it to learn the basics of machine learning and big data analysis. However, we'll see that if you want to do any real Big Data analysis, you'll need other tools.

Intro to Everything

Lastly, here is a tutorial on the whole scientific Python "ecosystem": www.scipy-lectures.org