Homework 1: Numpy, Scipy, Pandas

Due Friday Sept 9, 2016 @ 9am

#0: Get set up with your environment to work on and submit homework

a. Create a new homework repository at github

Name your repo something sensible (e.g. python-ay250-homeworks). Given your Berkeley affiliation you should be able to get private repos if you'd like.

b. Clone this repo locally and make a directory for this week's homework:

git clone https://github.com/profjsb/python-ay250-homework.git
cd /class/directories ## this will be different on your machine
cd python-ay250-homework
mkdir hw_1
echo "hw_1 README" >> hw_1/README.md
git add hw_1/README.md
git commit hw_1/README.md -m "added hw_1 directory"
git push

c. Copy this notebook into your hw_1 folder from a local version of the python-seminar repo

cd /class/directories
git clone https://github.com/profjsb/python-seminar.git 
cd python-seminar
git pull
cp Homeworks/hw_1/* /class/directories/python-ay250-homework/hw_1/

d. Get working! Be sure to check in your work as often as you'd like

cd /class/directories/python-ay250-homework
git add hw_1/<whatever>
git commit <whatever> -m "this is a check in"

e. To submit your work, send us (prof+GSIs) your github handle and repo name for us to clone (you'll need to add us to the repo if you've made a private one)

#1: Super-resolution imaging

Obtaining several snapshots of the same scene, from microscopes to telescopes, is useful for the postprocessing increase of signal to noise: by summing up imaging data we can effectively beat down the noise. Interestingly, if we image the same scene from different vistas we can also improve the clarity of the combined image. Being able to discern features in a scene from this combination effort is sometimes called super-resolution imaging.

Here, we'll combine about 4 seconds of a shaky video to reveal the statement on a license plate that is not discernable in any one frame.

A tarball of the data is at: https://drive.google.com/open?id=0B4vIeCR-xYNnbXFJTTVlVnpUZkk

tar -xvzf homework1_data.tgz  # do NOT check this files into git...

Problem 1 Read in each image into a numpy array. Resize each frame to be 3 times larger in each axis (ie. 9 times larger images). Using scipy.signal.fftconvolve find the offsets of each frame with respect to the first frame. Report those offsets to 2 decimal places.

  • Hint1: you'll need to figure out how to resize a numpy array
  • Hint2: you'll want to reverse the second image when doing the convolution: scipy.signal.fftconvolve(im1, im2[::-1, ::-1])
  • Hint3: you'll need to figure out how to identify the peak of the fft convolution to find the offsets between images

In [1]:
%pylab inline
import numpy as np
import scipy as sp
from scipy.ndimage import imread
from scipy.ndimage.interpolation import zoom
from scipy.signal import fftconvolve


Populating the interactive namespace from numpy and matplotlib

In [2]:
im=imread("Data/im2-0.png",flatten=True)
imshow(im,cmap=cm.Greys)


Out[2]:
<matplotlib.image.AxesImage at 0x10c3ae4e0>

In [3]:
##########################
### get list of images ###

import os
image_dir = "Data"
img_names = [x for x in sorted(os.listdir(image_dir)) if x.startswith("im2-")]
img_paths = [os.path.join(image_dir, x) for x in img_names]

In [4]:
ori_size = im.shape # 288*288
resized = np.empty((len(img_paths), ori_size[0]*3, ori_size[1]*3))
thresholded = np.empty((len(img_paths), ori_size[0]*3, ori_size[1]*3))

for i, (img_name, img_path) in enumerate(zip(img_names, img_paths)):
    resized[i] = zoom(sp.ndimage.imread(img_path, flatten = True), 3)
    #-------------------- Thresholding ------------------#
    # We will threshold the images s.t. 1/2 of the image is dark.
    # W/O the thresholding, the correlation will always be the highest at 
    # the center of the image
    tmp = resized[i].copy()
    tmp[:150] = 0
    tmp[715:] = 0
    tmp[tmp <= np.percentile(tmp[150:715].ravel(), 50)] = 0
    thresholded [i] = tmp

In [5]:
################################
##### offset of two images #####

def shift(im, im0):
    # Correlate2d is too slow, so we will try to use fftconvolve 
    corr = fftconvolve(im0, im[::-1, ::-1], mode = 'full')
    # Finding the index of the highest correlation
    idx = np.unravel_index(np.abs(corr.argmax()), corr.shape)
    # Computing the offset
    origin = np.array(corr.shape)/2
    offset = idx - origin
    return offset

In [6]:
corr = fftconvolve(thresholded[1], thresholded[0][::-1, ::-1], mode = 'full')
corr


Out[6]:
array([[  5.31243106e-07,   5.50353923e-07,   5.69476946e-07, ...,
          4.43361311e-07,   4.63209539e-07,   4.83204468e-07],
       [  6.28146206e-07,   6.45527962e-07,   6.63008840e-07, ...,
          5.47740771e-07,   5.65949585e-07,   5.84356711e-07],
       [  3.54874762e-07,   3.70728755e-07,   3.86592675e-07, ...,
          2.80506982e-07,   2.97414225e-07,   3.14452887e-07],
       ..., 
       [  3.64149452e-07,   3.65059110e-07,   3.65900263e-07, ...,
          3.55827340e-07,   3.58355459e-07,   3.60934843e-07],
       [  3.57254639e-07,   3.59222739e-07,   3.61119358e-07, ...,
          3.43506267e-07,   3.47338745e-07,   3.51146513e-07],
       [  1.71873329e-07,   1.72317477e-07,   1.72665857e-07, ...,
          1.65242510e-07,   1.67513260e-07,   1.69703792e-07]])

In [7]:
offset = np.empty((len(img_paths), 2))
for i, img in enumerate(thresholded):
    offset[i] = shift(img, thresholded[0])

In [8]:
float_formatter = lambda x: "%.2f" % x
np.set_printoptions(formatter={'float_kind':float_formatter})
print(offset)


[[-0.50 -0.50]
 [7.50 -39.50]
 [103.50 -8.50]
 [56.50 -166.50]
 [62.50 -23.50]
 [24.50 -104.50]
 [55.50 -8.50]
 [176.50 -19.50]
 [113.50 32.50]
 [55.50 -73.50]
 [106.50 -96.50]
 [40.50 -23.50]
 [121.50 -7.50]
 [-48.50 -81.50]
 [78.50 -109.50]
 [47.50 -97.50]
 [15.50 -107.50]
 [112.50 -63.50]
 [120.50 -62.50]
 [-0.50 -167.50]
 [45.50 48.50]
 [51.50 56.50]
 [137.50 -0.50]
 [177.50 32.50]
 [80.50 -127.50]
 [113.50 87.50]
 [81.50 16.50]
 [168.50 -57.50]
 [-1.50 -112.50]
 [104.50 -0.50]
 [157.50 63.50]
 [98.50 49.50]
 [55.50 -44.50]
 [44.50 135.50]
 [-0.50 1.50]
 [80.50 -8.50]
 [112.50 -2.50]
 [91.50 12.50]
 [79.50 -24.50]
 [61.50 9.50]
 [7.50 -148.50]
 [49.50 -90.50]
 [80.50 -99.50]
 [136.50 28.50]
 [137.50 -5.50]
 [6.50 32.50]
 [129.50 7.50]
 [56.50 -58.50]
 [66.50 -142.50]
 [-0.50 -31.50]
 [87.50 -137.50]
 [64.50 -36.50]
 [-13.50 20.50]
 [-0.50 -64.50]
 [22.50 -24.50]
 [112.50 -104.50]
 [133.50 71.50]
 [23.50 -71.50]
 [194.50 -21.50]
 [-49.50 -44.50]
 [28.50 -88.50]
 [47.50 -122.50]
 [15.50 -7.50]
 [-8.50 -145.50]
 [105.50 35.50]
 [105.50 -57.50]
 [63.50 -56.50]
 [-0.50 -121.50]
 [55.50 -1.50]
 [15.50 -3.50]
 [78.50 -10.50]
 [28.50 -109.50]
 [-8.50 -57.50]
 [68.50 7.50]
 [-24.50 -11.50]
 [72.50 -65.50]
 [57.50 -72.50]
 [126.50 -40.50]
 [64.50 -35.50]
 [56.50 -95.50]
 [94.50 -30.50]
 [143.50 38.50]
 [53.50 -74.50]
 [-56.50 -0.50]
 [107.50 8.50]
 [215.50 -43.50]
 [187.50 -15.50]
 [65.50 89.50]
 [14.50 -55.50]
 [-0.50 -80.50]
 [185.50 90.50]
 [88.50 10.50]
 [112.50 55.50]
 [79.50 -154.50]
 [119.50 47.50]
 [0.50 -56.50]
 [104.50 -17.50]
 [15.50 16.50]
 [144.50 45.50]
 [92.50 -5.50]]

Problem 2 Shift each image to register the frames to the original (expanded in size) frame. You should, in general, be shifting by subpixel offsets. You might want to look at scipy.ndimage.interpolation.shift


In [9]:
#####################################
## shift back in the resized graph ##
shifted = np.empty((len(img_paths), shape(resized[0])[0], shape(resized[0])[0]))
for i, (img, shift) in enumerate(zip(resized, offset)):
    shifted[i] = sp.ndimage.interpolation.shift(img, shift)

Problem 3 Combine all the registered images to form a super-resolution image. What does the license plate read?


In [10]:
im_com = np.mean(shifted, axis = 0)
imshow(im_com,cmap=cm.Greys)


Out[10]:
<matplotlib.image.AxesImage at 0x10f4ad320>

#2: An elementary introduction to spectral audio compression

In this problem, we'll explore the very basics of audio compression in the spectral domain using numpy and scipy. We'll do a bit of visualization with matplotlib, but since that is covered later in the course, we'll provide those functions for you.

Audio compression is a large and complex topic, and the design of a format for compressed audio such as the popular MP3 is too complex to cover in detail here. However, we will introduce the basic tools that most such compression formats use, namely:

  1. Converting the input signal to the frequency domain by taking a Fast Fourier Transform (FFT).

  2. Dropping information in the frequency domain, resulting in a smaller amount of data.

  3. Reconstructing back the signal in the time domain from this smaller representation of the signal.

Steps 1 and 2 above are the 'encoding' part of signal compression, and step 3 is the 'decoding' part. For this reason, the tools that perform these steps are typically referred to as signal 'codecs', short for encoders/decoders.

Note that here we say 'signal': while MP3 is an audio format, the same ideas apply to the compression of digital images with formats such as JPEG and video. Virtually all multimedia technologies we use today, from audio players to cell phones, digital cameras and YouTubeVideo, are based on sophisticated extensions and applications of these simple ideas.

Let's first load the plotting tools and importing some tools we'll need later:


In [11]:
%pylab inline
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt

# we'll need some path manipulations later on
import os


Populating the interactive namespace from numpy and matplotlib
WARNING: pylab import has clobbered these variables: ['imread']
`%matplotlib` prevents importing * from pylab and numpy

We define a simple utility function to listen to audio files right in the browser:


In [12]:
def Audio(fname):
    """Provide a player widget for an audio file.
    
    Parameters
    ==========
    fname : string
      Filename to be played.
      
    Warning
    =======
    
    Browsers cache audio very aggressively. If you change an
    audio file on disk and are trying to listen to the  new version, you 
    may want to 
    """
    from IPython.display import HTML, display
    
    # Find out file extension and deduce MIME type for audio format
    ext = os.path.splitext(fname)[1].replace('.', '').lower()
    mimetype = 'audio/' + ('mpeg' if ext == 'mp3' else ext)
    
    tpl = """<p>{fname}:</p>
<audio controls>
    <source src="files/{fname}" type="{mimetype}">

Your browser does not support the Audio element; you can play 
<a href="files/{fname}">this file</a> manually.
</audio>
"""
    display(HTML(tpl.format(**locals())))

We also define a convenience wrapper around plt.specgram, matplotlib's spectrogram function, with a colorbar and control over the color limits displayed. This will make it easier to compare across different signals with the same colors for all inputs.


In [13]:
def specgram_cbar(x, title=None, clim=(0, 80) ):
    """Plot spectrogram with a colorbar and range normalization.
    
    Call matplotlib's specgram function, with a custom figure size, 
    automatic colobar, title and custom color limits to ease 
    comparison across multiple figures.
    
    Parameters
    ==========
    x : array
      One-dimensional array whose spectrogram should be plotted.
      
    title : string
      Optional title for the figure.
      
    clim : 2-tuple
      Range for the color limits plotted in the spectrogram.
    """
    f = plt.figure(figsize=(10,3))
    plt.specgram(x)
    plt.colorbar()
    plt.clim(*clim)
    if title is not None:
        plt.title(title)
    plt.show()

Problem 1: Use the Audio function above to listen to the signal we will be experimenting with, a simple voice recording stored in the file hw_0_data/voice.wav.

Note: if your browser doesn't support audio, you may try a different browser. We've tested current versions of Chrome and Firefox, and it works OK with both.


In [14]:
Audio("Data/voice.wav")


Data/voice.wav:

Problem 2: Write a function to compress a 1-d signal by dropping a fraction of its spectrum.

You can drop the smallest components by setting their values to zero.

Hints:

  • look at the np.fft module, keeping in mind that your input signal is real.
  • look at the argsort method of numpy arrays.

In [15]:
def compress_signal(x, fraction):
    """Compress an input signal by dropping a fraction of its spectrum.
    
    Parameters
    ==========
    x : array
      1-d real array to be compressed
      
    fraction : float
      A number in the [0,1] range indicating which fraction of the spectrum
      of x should be zeroed out (1 means zero out the entire signal).
      
    Returns
    =======
    x_approx : array
      1-d real array reconstructed after having compressed the input.
    """
    x_fft = np.fft.rfft(x)
    lb = np.percentile(x_fft, 100*fraction)
    #result = ma.masked_array(x, mask = (x <= lb), fill_value = 0)
    x_comprs = x_fft * (x_fft>=lb)
    x_ifft = np.fft.irfft(x_comprs)
    return x_ifft

As a quick visual check (not that this is not a formal test of correctness), experiment with a simple random signal by changing the compression ratio and plotting both the signal and the compressed version:


In [16]:
x = np.random.rand(128)

In [17]:
fraction = 0.6  # play changing this in the 0-1 range

xa = compress_signal(x, fraction)

plt.figure(figsize=(12,3))
plt.plot(x, alpha=0.5, lw=2, label='original')
plt.plot(xa, lw=2, label='compressed {0:.0%}'.format(fraction))
plt.legend();


Problem 3: Write a function that will compress an audio file by a dropping a fraction of its spectrum, writing the output to a new file.

If the input file is named a.wav and the compression fraction is 0.9, the output file should be named a_comp_0.9.wav.

Hints:

  • look at the scipy.io module for routines dealing with files in wav format.

  • you may need to use the astype method of numpy arrays to get the correct data type for wav files.


In [31]:
import scipy.io.wavfile

def compress_wav(fname, fraction):
    """Compress an audio signal stored in an input wav file.
    
    The compressed signal is returned as a numpy array and automatically written 
    to disk to a new wav file.
    
    Parameters
    ==========
    fname : string
      Name of the input wav file
      
    fraction : float
      Fraction of input data to keep.
      
    Returns
    =======
    rate : int
      Bit rate of the input signal.

    x : array
      Raw data of the original input signal.
      
    x_approx : array
      Raw data of the compressed signal.
      
    new_fname : string
      Auto-generated filename of the compressed signal.
    """
    
    rate, x = scipy.io.wavfile.read(fname)
    new_fname = fname[:-4] + "_comp_" + str(fraction) + fname[-4:] 
    #Fix it to satisfy the name requirements
    
    x_approx = compress_signal(x, fraction)
    sp.io.wavfile.write(new_fname, rate, x_approx.astype("int16"))
    return([rate, x, x_approx, new_fname])

Problem 4: Study the effect of compressing the input file at different ratios: 0.1, 0.5, 0.75, 0.9, 0.95, 0.99.

Using the OrderedDict class from the Python collections module, store the uncompressed signal as well as the compressed array and filename for each compression ratio.

You will create an OrderedDict called voices, with:

  • keys: compression ratios
  • values: pairs of (x, filename) where x is the compressed audio and filename is the name of the compressed file.

In [32]:
from collections import OrderedDict
ratios = [0, 0.1, 0.5, 0.75, 0.9, 0.95, 0.99]
all_compress = OrderedDict()
for ratio in ratios:
    all_compress[ratio] = compress_wav("Data/voice.wav", ratio)

Problem 5: Loop over the voices dict, and for each one generate an audio player as well as a spectrogram. Observe how the spectrogram changes, and listen to each file. At what ratio do you stop understanding the recording?


In [33]:
for compress in all_compress:
    Audio(all_compress[compress][3])
    specgram_cbar(all_compress[compress][2])


Data/voice_comp_0.wav:

Data/voice_comp_0.1.wav:

Data/voice_comp_0.5.wav:

Data/voice_comp_0.75.wav:

Data/voice_comp_0.9.wav:

Data/voice_comp_0.95.wav:

Data/voice_comp_0.99.wav:


In [21]:
plt.figure(figsize=(12,3))
for ratio in all_compress:
    signal = all_compress[ratio][2]
    plt.plot(signal, lw=2, label='compressed {0:.0%}'.format(ratio))
plt.legend();


#3: Armchair Astronomer

Often times, people act as good sensors of the physical universe. We can use Google Trends data to help us determine some fundamental parameters of the Solar System.

Problem 1: Using just the CSV file we created in the pandas lecture (merged_data.csv) and some frequency analysis tools in scipy to determine:

  • the number of days in a year
  • the period of the moon's orbit around the Earth

Hint: from scipy.signal.spectral import lombscargle


In [22]:
import numpy as np
import pandas as pd  
from scipy.signal.spectral import lombscargle
pd.options.display.width = 1000

In [23]:
data = pd.read_csv('merged_data.csv')
moon = np.array(data["full moon"])
icecream = np.array(data["ice cream"])
volleyball = np.array(data["Volleyball"])
tennis = np.array(data["tennis"])
data.head()


Out[23]:
Week ice cream Volleyball tennis full moon
0 2011-09-03 25 18 83 26
1 2011-09-10 21 18 58 50
2 2011-09-17 20 18 25 14
3 2011-09-24 21 18 24 17
4 2011-10-01 20 18 24 21

In [24]:
# calculate year by periodicity of icecream/volleyball/tennis

day_to_seconds = 24 * 3600.0
timesteps = np.arange(icecream.shape[0]) * day_to_seconds * 7
periods = np.linspace(100 * day_to_seconds, 500 * day_to_seconds, 10000) 
ang_freq = 1.0/periods * 2 * pi

periodogram_ic = lombscargle(timesteps.astype('float64'), icecream.astype('float64'), ang_freq)
print("year by icecream is:", 1/(ang_freq[np.argmax(periodogram_ic)]/2/pi)/day_to_seconds)

periodogram_vb = lombscargle(timesteps.astype('float64'), volleyball.astype('float64'), ang_freq)
print("year by volleyball is:", 1/(ang_freq[np.argmax(periodogram_vb)]/2/pi)/day_to_seconds)

periodogram_tn = lombscargle(timesteps.astype('float64'), tennis.astype('float64'), ang_freq)
print("year by volleyball is:", 1/(ang_freq[np.argmax(periodogram_tn)]/2/pi)/day_to_seconds)

plt.plot(periods/day_to_seconds, periodogram_ic, label = "ice cream")
plt.plot(periods/day_to_seconds, periodogram_vb, label = "volleyball")
plt.plot(periods/day_to_seconds, periodogram_tn, label = "tennis")
plt.title("days in a year")
plt.legend(loc = 2)


year by icecream is: 359.945994599
year by volleyball is: 351.465146515
year by volleyball is: 352.345234523
Out[24]:
<matplotlib.legend.Legend at 0x15b5e9fd0>

In [25]:
# calculate period of moon's orbit, predetermine in [10, 50]

periods = np.linspace(10 * day_to_seconds, 50 * day_to_seconds, 10000) 
ang_freq = 1.0/periods * 2 * pi

periodogram_moon = lombscargle(timesteps.astype('float64'), moon.astype('float64'), ang_freq)
print("month by moon is:", 1/(ang_freq[np.argmax(periodogram_moon)]/2/pi)/day_to_seconds)

plt.plot(periods/day_to_seconds, periodogram_moon, label = "month by moon")
plt.title("period of the moon's orbit around the Earth")


month by moon is: 29.5299529953
Out[25]:
<matplotlib.text.Text at 0x15b8def60>

#4: Reproducing some insights about the Election

Nate ("not a genius, just a Bayesian") Silver writes often about polls and their utility of predicting elections. One of the things he emphasized during the 2016 campaign is that even "large" polls of people with a consistent lead for one candidate will show wild swings in any given window in time.

Problem 1: Using Pandas and numpy, try to reproduce this plot from a Nate Silve Tweet qualitatively using the same assumptions.

https://twitter.com/NateSilver538/status/769565612955824128


In [26]:
# randomly generate binomial(0.53, n) for Clinton
import numpy as np
import pandas as pd
import matplotlib.dates as mdates

times = pd.date_range('05/20/2016', '11/15/2016')
poll = np.random.binomial(200, 0.53, size=(times.shape[0] + 5,))
cumpoll = np.array(poll.cumsum())
rolling_votes = cumpoll[5:] - cumpoll[:-5]
rolling_prob = rolling_votes/1000 * 2 -1
y_axis = np.arange(9)*3/100-0.03


fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.set_title("Clinton's 'lead' in hypothetical tracking poll ", 
              size = 14, family='Calibri', fontweight="bold")
ax1.plot(times, rolling_prob, color = 'purple')
ax1.plot(times, np.zeros(shape(times)), color = 'black')
ax1.set_yticklabels(['{:2.0f}%'.format(x*100) for x in y_axis])
ax1.grid()


/Users/shiman/anaconda/envs/bootcamp-env/lib/python3.5/site-packages/matplotlib/font_manager.py:1288: UserWarning: findfont: Font family ['Calibri'] not found. Falling back to Bitstream Vera Sans
  (prop.get_family(), self.defaultFamily[fontext]))

Problem 2: Clearly, even with a 6% point lead, there's a chance that this sort of poll would show the other person in the lead. How much would ahead (in percent) would a candidate need to be to have a tracking poll never show the other candidate to be ahead over the course of a year (in your simulation)?


In [27]:
def always_lead(pr, num = 200, times = pd.date_range('05/20/2016', '11/15/2016')):
    poll = np.random.binomial(num, pr, size=(times.shape[0] + 5,))
    cumpoll = np.array(poll.cumsum())
    rolling_votes = cumpoll[5:] - cumpoll[:-5]
    rolling_prob = rolling_votes/num/5 * 2 -1
    return (int(min(rolling_prob)>=0))

In [40]:
### simulate n = 1000 times, sequantially increasing prob of leading
### if alpha = 95% of the case satisfied, we accept

alpha = 0.95
n = 1000
lead = np.arange(200)/400+0.5

for prob in lead:
    sim = np.ones(n) * prob
    for i in range(n):
        sim[i] =  always_lead(prob)
    if sum(sim) >= alpha * n:
        print("Min points to guarantee always leading is:", (prob*2-1)*100, "%")
        break


Min points to guarantee always leading is: 10.5 %

Problem 3: With a 3 and 6% lead, how many people would need to be polled in 1 day to have the rolling 5-day poll result always show the leader ahead (over a year)?


In [38]:
alpha = 0.95
n = 1000
Ppl = np.arange(100, 20000, 20)
for ppl in Ppl:
    sim = np.zeros(n)
    for i in range(n):
        sim[i] = always_lead(pr = 0.515, num = ppl)
    if sum(sim) >= alpha * n:
        print("leading = 3%, min number of people to guarantee always leading is:", ppl)
        break


leading = 3%, min number of people to guarantee always leading is: 2460

In [39]:
for ppl in Ppl:
    sim = np.zeros(n)
    for i in range(n):
        sim[i] = always_lead(pr = 0.53, num = ppl)
    if sum(sim) >= alpha * n:
        print("leading = 6%, min number of people to guarantee always leading is:", ppl)
        break


leading = 6%, min number of people to guarantee always leading is: 620

I discussed this HW with Kevin Li, Turge Gurek, Ying Cao, Jing Dai


In [ ]: