Orienting Yourself

Image: @jakevdp

How to install packages using conda

If you're using anaconda, you probably already have most (if not all) of these installed. If you installed miniconda:

conda install numpy

Conda also has channels which allows anybody to distribute their own conda packages. There is an "astropy" channel for AstroPy affiliated packages. You can do:

conda install -c astropy astroml

To check if a package is available on conda:

conda search numpy

How to install packages using pip

Many smaller packages are not available via the conda package manager. For these, use pip:

pip install --no-deps corner

Why prefer conda?

conda is an actual package manager that will take care of resolving dependencies optimally.

NumPy



In [ ]:

    
from __future__ import print_function

import math
import numpy as np

If you use Python for any amount of time, you'll quickly find that there are some things it is not so good at. In particular, performing repeated operations via loops is one of its weaknesses.

For example, in pure Python:



In [ ]:

    
def add_one(x):
    return [xi + 1 for xi in x]



In [ ]:

    
x = list(range(1000000))
%timeit add_one(x)

Using numpy we would do:



In [ ]:

    
x = np.arange(1000000)
%timeit np.add(x, 1)

Why is pure Python so slow?

Image: @jakevdp

Operations in NumPy are faster than Python functions involving loops, because

The data type can be checked just once
The looping then happens in compiled code

Using NumPy efficiently

The name of the game is moving all array-oriented code into vectorized NumPy operations.



In [ ]:

    
# Point coordinates
x = np.random.rand(100000)
y = np.random.rand(100000)



In [ ]:

    
# calculate distance from origin
%%timeit
dist = np.empty(len(x))
for i in range(len(x)):
    dist[i] = math.sqrt(x[i]**2 + y[i]**2)



In [ ]:

    
%%timeit
dist = np.sqrt(x**2 + y**2)

Aside: How many arrays are created in the above cell?

Sometimes you have to get a little creative to "vectorize" things:



In [ ]:

    
x = np.arange(10)**2
x



In [ ]:

    
# difference between adjacent elements
x[1:] - x[:-1]



In [ ]:

    
# by the way, this is basically the implementation of `np.ediff1d`
np.ediff1d??

Some interesting properties of numpy functions

Functions that operate element-wise on arrays are known as universal functions ("UFuncs").

UFuncs have some methods built-in, which allow for some very interesting, flexible, and fast operations:



In [ ]:

    
x = np.arange(5)
y = np.arange(1, 6)
x + y

All operators (like +) actuall call an underlying numpy function: in this case np.add:



In [ ]:

    
np.add(x, y)

These ufuncs have some interesting and useful properties:



In [ ]:

    
np.add.accumulate(x)



In [ ]:

    
np.multiply.accumulate(x)



In [ ]:

    
np.multiply.accumulate(y)



In [ ]:

    
np.add.identity



In [ ]:

    
np.multiply.identity



In [ ]:

    
np.add.outer(x, x)



In [ ]:

    
np.multiply.outer(x, x)

numpy aggregates

Aggregates are functions that take an array and return a smaller-dimension array.



In [ ]:

    
z = np.arange(10, dtype=np.float64).reshape((2, 5))
z



In [ ]:

    
np.sum(z)



In [ ]:

    
# alternate spelling:
z.sum()



In [ ]:

    
np.mean(z)



In [ ]:

    
np.min(z), np.max(z)



In [ ]:

    
# could also use ufunc 
np.add.reduce(z, axis=0)



In [ ]:

    
np.add.reduce(x)



In [ ]:

    
# equivalent to sum:
np.sum(z, axis=0)

Indexing

Slice indexing



In [ ]:

    
x = np.arange(15)
x



In [ ]:

    
x[0:5]



In [ ]:

    
x[0:10:2]  # with a stride



In [ ]:

    
x[10:0:-2]  # reversed

This sort of indexing does not make a copy:



In [ ]:

    
y = x[0:10:2]
y[0] = 100.  # modify y
y



In [ ]:

    
# x is modified:
x

Indexing with indicies



In [ ]:

    
x = np.arange(15)
y = x[[1, 2, 4]]
y



In [ ]:

    
y[0] = 100
y



In [ ]:

    
# x is not modified
x

Indexing with booleans



In [ ]:

    
x = np.arange(5)
x



In [ ]:

    
mask = np.array([True, True, False, True, False])
x[mask]



In [ ]:

    
# creates a copy
y = x[mask]
y[0] = 100.
print(y)
print(x)

How do you remember which type of indexing creates a copy?

NumPy arrays are stored as a chunk of data and a set of strides in each dimension. Boolean and arbitrary indicies cannot be represented this way, so numpy must make a copy.

More on masking

All indexing operations also work in assigning to an array. Here we demonstrate assignment with booleans.

For example, imagine you have an array of data where negative values indicate some kind of error.



In [ ]:

    
x = np.array([1, 2, 3, -999, 2, 4, -999])

How might you clean this array, setting all negative values to, say, zero?



In [ ]:

    
for i in range(len(x)):
    if x[i] < 0:
        x[i] = 0
x



In [ ]:

    
x = np.array([1, 2, 3, -999, 2, 4, -999])

mask = (x < 0)
mask

And the mask can be used directly to set the value you desire:



In [ ]:

    
x[mask] = 0
x

Often you'll see this done in a separate step:



In [ ]:

    
x = np.array([1, 2, 3, -999, 2, 4, -999])
x[x < 0] = 0
x



In [ ]:

    
# additional boolean operations: invert
x = np.array([1, 2, 3, -999, 2, 4, -999])
x[~(x < 0)] = 0
x



In [ ]:

    
x = np.array([1, 2, 3, -999, 2, 4, -999])
x[(x < 0) | (x > 3)] = 0
x

Broadcasting



In [ ]:

    
x = np.arange(4)
x



In [ ]:

    
x + 3



In [ ]:

    
x = np.array([[0,  0,  0],
              [10, 10, 10],
              [20, 20, 20],
              [30, 30, 30]])
y = np.array([0, 1, 2])
print("x shape:", x.shape)
print("y shape:   ", y.shape)



In [ ]:

    
# If x and y are different dimensions, shape is padded on left with 1s
# before broadcasting.
x + y



In [ ]:

    
x = np.array([[0],
              [10],
              [20],
              [30]])
y = np.array([0, 1, 2])
print("x shape:", x.shape)
print("y shape:   ", y.shape)



In [ ]:

    
x + y

Broadcasting rules:

If the two arrays differ in their number of dimensions, the shape of the array with fewer dimensions is padded with ones on its leading (left) side.
If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.
If in any dimension the sizes disagree and neither is equal to 1, an error is raised.

Mini exercises

Assume you have $N$ points in $D$ dimensions, represented by an array of shape (N, D), where there are some mising values scattered throughout the points.

Count the number of points (rows) with no missing values, using np.any or np.all.
Clean the array of the points with missing values.
Construct the matrix M, the centered and normalized version of the X array: $$ M_{ij} = (X_{ij} - \mu_j) / \sigma_j $$ using np.mean and np.std. This is one version of whitening the array.



In [ ]:

    
np.random.seed(0)
X = np.random.rand(5000)
X[np.random.randint(0, 5000, size=500)] = np.nan  # 10% missing
X = X.reshape((1000, 5))  # 1000 points in 5 dimensions



In [ ]:

    
# 1. Compute the number of points (rows) with no missing values, using `np.any` or `np.all`.



In [ ]:

    
# 2. Clean the array, leaving only rows with no missing values



In [ ]:

    
# 3. Compute the whitened version of the array using np.mean and np.std.

What else is in NumPy?

numpy.random: Random number generation
numpy.linalg: Some linear algebra routines
numpy.fft: Fast Fourier Transform



In [ ]:

    
print(np.random.__doc__)

SciPy

Interestingly, scipy predates numpy by more than half a decade (circa 1999), even though it is built on top of numpy.

Originally "scipack", a collection of wrappers for Fortran NetLib libraries.



In [ ]:

    
# contents of scipy:
import scipy
print(scipy.__doc__)

Note the overlap:

numpy.fft / scipy.fft
numpy.linalg / scipy.linalg

Why the duplication? The scipy routines are based on Fortran libraries, whereas numpy is C-only.

AstroPy

Project started in 2011, in response to increasing duplication in Python astronomy ecosystem.

Initially brought together several existing Python packages:

astropy.io.fits (formerly pyfits)
astropy.io.ascii (formerly asciitable)
astropy.wcs (formerly pywcs)
astropy.cosmology (formerly cosmolopy)

Now also contains:

astropy.table (Table class and table operations)
astropy.units (Quantity: arrays with units)
astropy.coordinates (astronomical coordinate systems)
astropy.time (UTC, UT, MJD, etc)
astropy.stats (additional stats not in scipy)
astropy.modeling (simple model fitting)
astropy.vo (virtual observatory)
astropy.io.votable
astropy.analytic_functions

Example: Coordinates



In [ ]:

    
from astropy import coordinates as coords
from astropy import units as u



In [ ]:

    
ra = 360. * np.random.rand(100)
dec = -90. + 180. * np.random.rand(100)
print("RA:", ra[:5], "...")
print("Dec:", dec[:5], "...")



In [ ]:

    
c = coords.SkyCoord(ra, dec, unit=(u.deg, u.deg))
c



In [ ]:

    
# convert to galactic
g = c.galactic
g



In [ ]:

    
# access longitude or latitude
g.l



In [ ]:

    
type(g.l)



In [ ]:

    
# get underlying numpy array
g.l.degree

Other astro packages

AstroPy-affiliated packages

PyPI packages with topic "Astronomy"



In [ ]: