G. Richards, 2016
This notebook contains most everything that we need to get started. It draws heavily from classes taught by Andy Connolly (http://cadence.lsst.org/introAstroML/) and Karen Leighly (seminar.ouml.org).
Things will be a lot easier if you can download the file that we are currently viewing (which is a jupyter notebook---more on that later). Then you won't have to try to read what is on the screen at the front of the room---you can just read your own screen. More importanly you can interact directly with the notebook. We will do the same as the course goes on.
So, we will use a repository tool called 'Git' that will let you download the notebooks that I put into the repository before each class. If you miss a class your homework will be to "complete" the notebook before the next class.
If you want, you can make a account on GitHub, where you can create your own repositories. But for now, you just need to have Git installed on your machine so that you can use it to 'pull' files from the class repository.
To get Git for any platform, see: https://git-scm.com/download/.
If you are using the Newton cluster machines, Git is already installed.
Once Git is installed, make a place where you are going to put the class repository. Perhaps in a git subdirectory if you think that you might use more repositories later (we might even for this class!).
[~]% cd ~
[~]% mkdir git
[~]% cd git
Now we need to "clone" the class repository into your account
[~]% git clone https://github.com/gtrichards/PHYS_T480.git
This will make a subdirectory called "PHYS_T480" in which you will see a file called "InitialSetup.ipynb", which is the file that you are reading now!
If you already know what a jupyter (or IPython) notebook is, go ahead and open the file with jupyter. Otherwise, hang on and we'll get to that.
Before each class, you will want to update this repository so that you get any new files that I have put there for you. Do that with
[~]% cd ~/git/PHYS_T480
[~]% git status
[~]% git pull
Everyone will need to have the proper software environment in order to be able to participate in the course. This can either be on the newton cluster machines or on your laptop. Which is up to you (but we don't have enough computers for everyone!)
The software requirements are as follows: Anaconda Python 2.7.X (not 3.X unless you know how to deal with the differences)
NumPy version 1.11+: efficient array operations
SciPy version 0.17+: scientific computing tools
matplotlib version 1.5+: plotting and visualization
jupyter version 1.0+: interactive computing (formerly IPython)
Scikit-learn version 0.17+: machine learning
astroML: an astronomical machine learning toolkit
If you don’t already have Anaconda Python 2.7.X, then go to https://www.continuum.io/downloads and download the installer of your choice (e.g., the graphical OSX installer. Again for Python 2.7, not 3). Don't worry--it's free. But by all means, have them send you the cheat sheet if you are asked.
Open the install package and follow the instructions, installing for you only.
Open a new terminal window, and make sure your $PATH variable points to the Anaconda installation. You can do this by typing
[~]% which python
The result should show the path to the newly-installed anaconda folder. If not, you must modify your $PATH variable to point to the anaconda directory as follows:
I use TCSH, so I added the following to my .tcshrc file:
set path=(/Users/gtr/anaconda/bin $path)
If you use BASH instead, you’ll have to add the following to your .bashrc file:
PATH="/home/newton3/gtr/anaconda2/bin:$PATH"
For good measure (since I have about 12 python installations on my computer), I usually create an alias for Anaconda python so that I know that I’m starting that particular python if I’m using the command line.
alias apython '/Users/gtr/anaconda/bin/python'
Now either close that terminal window and open a new one, or type
[~]% source .tcshrc
or [~]% source .bashrc
then
[~]% which python
Hopefully all is well now.
You can (and should) update your anaconda distribution with the following command.
[~]% conda update conda
Note that conda is the package management system that comes with anaconda. Do this once a week.
I’m currently running Python 2.7.12 via conda-4.2.7. You don’t have to be doing exactly the same, but if you run into problems that I am not having, that would be a good place to start debugging.
Conda will have installed most of the the other software packages listed above. The exceptions are noted below.
astroML is the software library that goes together with the book. Much of it is a series of wrappers to scikit-learn, where Jake VanderPlas is one of the main contributors to each. Jake has written an Intro to AstroML paper that is worth going through to give you an idea of the sorts of things that we'll be doing in the class.
The astroML packages and add-ons can be installed using pip (the python package installer):
[~]% pip install astroML
then
[~]% pip install astroML_addons
The course requirements were setup up to ensure that you have at least some (extended) experience with Python. E.g., you might have taken the 113-114-115 series. If you don't know any Python at all, this class may be pretty tough going. However, many of you might need a refresher. A good place to start is Appendix A (more specifically A3) in the textbook.
I also recommend the codecademy course on Python. It is free and does a good job of walking you through things that you need to learn.
You might also be interested in A Student’s Guide to Python for Physical Modeling: Chapters 1-3 provide a good introduction to Python before getting into the "physical modeling" part. It encourages the use of spyder, which is a matlab-like interactive interface for Python.
And, of course, you can google 'python tutorial' (or some such) and find a plethora of things that you may or may not like better.
You can interface with Python by 1) starting it on the command line: python
; 2) starting an "interactive" Python interface IPython: ipython
, which provides lots of built in functionality through pre-defined scripts; or 3) using a web-based interactive interface to IPython, which used be called an 'ipython notebook', but is now called jupyter notebook
.
We'll be using the latter. In fact, each lecture will be in the form of a jupyter notebook that you'll download from the github repository before the start of each class. Indeed, now that we have everything installed, you should be able to go to your PHYS_T480 directory and type jupyter notebook InitialSetup.ipynb &
to open this file in your browser. (Chrome is recommended).
To start with Python uses the following numerical data types and operations
We will encounter three main types of collections of data:
It is worth noting that python begins indexing at 0 and uses row-major order.
Tuple
List
Dictionary
Here is a lot more on data structures: https://docs.python.org/2/tutorial/datastructures.html.
Each type of data structure has associated “methods”. A method is like a little built-in function that can be run on a data structure.
For example, open a new cell (or in a python terminal window) and do the following:
v=[27.0,35.0,101.0,57.0]
print v
v.sort()
print v
So $v$ has now been replaced with the sorted $v$.
Basic mathematical operations that can be applied to your data are found here: https://docs.python.org/2/library/math.html.
OK, hopefully that is a refresher for everyone, so let's get going with more complicated stuff. If you need more of a refresher than that, see the links above.
[~]% jupyter notebook &
Will pop up a web page in your default browser and show the Dashboard where you can navigate to a particular notebook or open a new one. You can also open a notebook directly with
[~]% jupyter notebook InitialSetup.ipynb &
If you are creating a new notebook, click on 'new' at the top-right of the page and select 'python'. (N.B. If I get Chrome Cast working, I'll ask you to edit the jupyter defaults to use Google Chrome for class.)
Notebooks have 2 primary cell types: 'Markdown' and 'Code'. The Markdown cells are basically just for you to read. The Code cells are meant to be executed (perhaps after you have filled in some blanks).
To execute a cell in the notebook, type 'shift-return'. If you see a * in the square bracket to the left of the cell or a 'Busy' in the tab title, it means the command is in the process of running, and you need to wait for it to finish.
The notebook is autosaved, so that when you return to a notebook, everything is the same as you left it. If you want to reset it, you can do “revert to checkpoint”. If you save it yourself, you will create a checkpoint, and the original version will be unavailable.
Here are some useful IPython commands to get you started (# followed by text indicates a comment and not what you type)!
In [ ]: ? # basic help function. Pops open a sub-frame at the bottom of page.
Close with "x".
In [ ]: %quickref # Pops open a quick reference card
In [ ]: ! # spawning out to the operating system;
e.g files=!ls will fill the variable files
with a list of files in your directory.
IPython also contains a number of “magic” commands. Two examples are
In [ ]: %matplotlib inline # makes plots within the web browser instead of popping up a new window
In [ ]: %whos # lists the vectors, etc. that you have defined.
IPython also has lots of keyboard shortcuts – the main one is shift-enter to run a cell.
NumPy is short for Numerical Python. It is the foundational package for scientific computing in Python. It is a library which will allow us to work with data structures called arrays that are more efficient for storing and manipulating data than other Python data structures (or C++ for that matter).
For example, cut and paste this into a new cell, then type shift-enter to run it:
import numpy as np
xlist = [1,2,3]
xarray = np.array(xlist)
twice_xarray = 2*xarray
print twice_xarray
This is far better than creating a for
loop over the entries in xlist
or even something fancier like a "list comprehension"
twice_xlist = [2*x for x in xlist]
print twice_xlist
Note the lack of commas in the array as compared to the list.
To load the Numpy library type:
In [ ]: import numpy as np
In fact, just plan on starting every new series of code cells with this!
The basic unit for numpy is an ndarray. See the link for examples of how to define, index, slice, etc. the array.
There are quite a few methods associated with arrays. Some useful ones include:
Method | Property |
---|---|
np.zeros(5,float) | yields a 5-element array of zeros of type float |
a=np.empty(4) | yields a 4-element empty array |
a.fill(5.5) | fills that array with 5.5 for all elements |
np.arange(5) | yields an integer array of length 5 with increasing values |
b=np.random.normal(10,3,5) | yields a 5 element array of normally distributed numbers with mean 10 and variance 3 |
mask=b > 9 | creates a boolean array determing which numbers are greater than 9 |
print b[mask] | prints the ones with values > 9 |
b[mask]=0 | sets the ones > 9 to zero |
Arrays can be multidimensional, e.g., c=np.random.normal(10,3,(2,4))
, creates a 2 x 4 array with normally distributed numbers with mean 10 and variance 3.
More methods for multidimensional arrays:
Method | Property |
---|---|
d=c[0,:] | grabs the first (0th) row of c. |
d=c[1,:] | grabs the second (1st) row of c. |
d=c[:,0] | grabs the first column of c. |
c.dtype | data type |
c.size | total number of elements |
c.ndim | number of dimensions |
c.shape | shape or dimensionality |
c.nbytes | memory used (bytes) |
c.min() | gives the minimum of c |
c.max() | gives the maximum of c |
c.sum() | sum of all elements |
c.mean() | mean of all elements |
c.std() | standard deviation of all elements |
c.sum(axis=0) | will present sum along the 0th axis (column direction). The result will have reduced dimensionality |
You can also operate with arrays, for example, adding them together, multiplying them, multiplying or adding a constant. There are, however, "broadcasting" rules so that you need to make sure you know what you are doing when dealing with arrays of different sizes.
SciPy is an "open-source software for mathematics, science, and engineering". We 'import' it the same way that we import numpy:
In [ ]: import scipy
SciPy is a suite of tools for data analysis including integration, statistical functions, interpolation. It is built on top of Numpy. Where Numpy is intended for array manipulation, Scipy is intended for analysis. This is where many useful tools will be. It may be worth looking at the users guide to get an idea of the kinds of functions are available. N.B. For some packages you need to import more than just the main scipy package. So, for example to see what is available in the scipy.integrate package, do the following:
import scipy.integrate
scipy.integrate?
You can also make use of TAB completetion to see what is available
scipy.integrate.[TAB]
The catch is that, since we are using notebooks, you need to have imported FIRST, because otherwise the computer doesn't know what you are asking about yet.
It is generally useful to be able to visualize your data. We will do that using the matplotlib library. Most of you should be familiar with it already, but we will likely be making some plots that are more complex than you are used to. One nice resource is this thumbnail gallery which you can use to figure out how to make a new plot.
Let's make sure that everything is working by making a simple plot
In [2]:
# magic command to make the figure pop up below instead of in a new window
%matplotlib inline
# invoke pyplot in matplotlib, give it an alias
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 2*np.pi, 300)
y = np.sin(x)
plt.plot(x, y)
y2 = np.sin(x**2)
plt.plot(x, y, label=r'$\sin(x)$')
plt.plot(x, y2, label=r'$\sin(x^2)$')
plt.title('Some functions')
plt.xlabel('x')
plt.ylabel('y')
plt.grid()
plt.legend();
You can change both the marker/line styles and colors. I highly recommend the use of the colors in the palettable library. You can see the colors at http://colorbrewer2.org/.
Try making some changes to the code above and see what happens.
The Scikit-learn library forms the core of computing tools that we will use for this class. The "scikit"s are add-ons to scipy. Scikit-learn is the add-on for machine learning.
It probably needs its own introductory tutorial as both the input and output may not be quite what you would have expected. You might want to spend some time going through their quick start guide, user's guide and tutorials here: http://scikit-learn.org/stable/documentation.html.
Ironically, Scikit-learn is NOT really intended for Big Data (despite the title of the course). We will be using it to learn the basics of machine learning and big data analysis. However, we'll see that if you want to do any real Big Data analysis, you'll need other tools.
Lastly, here is a tutorial on the whole scientific Python "ecosystem": www.scipy-lectures.org