Exploring data

Names of group members

// SOLUTIONS

Goals of this assignment

The purpose of this assignment is to explore data using visualization and statistics.

Section 1

The file datafile_1.csv contains a three-dimensional dataset and associated uncertainty in the data. Read the data file into numpy arrays and visualize it using two new types of plots:

  1. 2D plots of the various combinations of dimensions (x-y, x-z, y-z), including error bars (using the pyplot errorbar() method). Try plotting using symbols instead of lines, and make the error bars a different color than the points themselves.
  2. 3D plots of all three dimensions at the same time using the mplot3d toolkit - in particular, look at the scatter() method.

Hints:

  • Look at the documentation for numpy's loadtxt() method - in particular, what do the parameters skiprows, comments, and unpack do?
  • If you set up the 3D plot as described above, you can adjust the viewing angle with the command ax.view_init(elev=ANGLE1,azim=ANGLE2), where ANGLE1 and ANGLE2 are in degrees.

In [ ]:
# put your code here, and add additional cells as necessary.

%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np

alldata = np.loadtxt('datafile_1.csv',comments='#',unpack=True,delimiter=',')

xval = alldata[0]
xerr = alldata[1]
yval = alldata[2]
yerr = alldata[3]
zval = alldata[4]
zerr = alldata[5]

plt.errorbar(xval,yval,xerr=xerr,yerr=yerr,fmt='ro',ecolor='b')

In [ ]:
plt.errorbar(xval,zval,xerr=xerr,yerr=zerr,fmt='ro',ecolor='b')

In [ ]:
plt.errorbar(yval,zval,xerr=yerr,yerr=zerr,fmt='ro',ecolor='b')

In [ ]:
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.plot(xval, yval, zval,'bo')

ax.view_init(elev=30., azim=20)

Section 2

Now, we're going to experiment with data exploration. You have two data files to examine:

  1. GLB.Ts.csv, which contains mean global air temperature from 1880 through the present day (retrieved from the NASA GISS surface temperature website, "Global-mean monthly, seasonal, and annual means, 1880-present"). Each row in the data file contains the year, monthly global average, yearly global average, and seasonal global average. See this file for clues as to what the columns mean.
  2. bintanja2008.txt, which is a reconstruction of the global surface temperature, deep-sea temperature, ice volume, and relative sea level for the last 3 million years. This data comes from the National Oceanic and Atmospheric Administration's National Climatic Data Center website, and can be found here.

Some important notes:

  • These data files are slightly modified versions of those on the website - they have been altered to remove some characters that don't play nicely with numpy (letters with accents), and symbols for missing data have been replaced with 'NaN', or "Not a Number", which numpy knows to ignore. No actual data has been changed.
  • In the file GLB.Ts.csv, the temperature units are in 0.01 degrees Celsius difference from the reference period 1950-1980 - in other words, the number 40 corresponds to a difference of +0.4 degrees C compared to the average temperature between 1950 and 1980. (This means you'll have to renormalize your values by a factor of 100.)
  • In the file bintanja2008.txt, column 9, "Global sea level relative to present," is in confusing units - more positive values actually correspond to lower sea levels than less positive values. You may want to multiply column 9 by -1 in order to get more sensible values.

There are many possible ways to examine this data. First, read both data files into numpy arrays - it's fine to load them into a single combined multi-dimensional array if you want, or split the data into multiple arrays. We'll then try a few things:

  1. For both datasets, make some plots of the raw data, particularly as a function of time. What do you see? How is the data "shaped"? Is there periodicity?
  2. Do some simple data analysis. What are the minimum, maximum, and mean values of the various quantities? (You may have problems with NaN - see nanmin and similar methods)
  3. If you calculate some sort of average for annual temperature in GLB.Ts.csv (say, the average temperature smoothed over 10 years), how might you characterize the yearly variability? Try plotting the smoothed value along with the raw data and show how they differ.
  4. There are several variables in the file bintanja2008.txt - try plotting multiple variables as a function of time together using the pyplot subplot functionality (and some more complicated subplot examples for further help). Do they seem to be related in some way? (Hint: plot surface temperature, deep sea temperature, ice volume, and sea level, and zoom in from 3 Myr to ~100,000 years)
  5. What about plotting the non-time quantities in bintanja2008.txt versus each other (i.e., surface temperature vs. ice volume or sea level) - do you see correlations?

In [ ]:
# put your code here, and add additional cells as necessary.

# column 0 is years, column 13 is yearly average
global_temp = np.loadtxt('GLB.Ts.csv',skiprows=1,unpack=True,delimiter=',')


# plot the monthly data so we can get some sense of range.
for i in range(1,13):
    plt.plot(global_temp[0],global_temp[i]/100.0,'r-')

# plot yearly data over this
plt.plot(global_temp[0],global_temp[13]/100.0,'b.')

# this is going to confuse people; encourage them to google it.
# http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.nanmin.html
print("min, max, mean: {:3f} {:3f} {:3f}".format(np.nanmin(global_temp[13])/100.0,
                                                 np.nanmax(global_temp[13])/100.0,
                                                 np.nanmean(global_temp[13])/100.0))
smoothed = np.zeros_like(global_temp[13])

for i in range(global_temp[13].size):
    start = i - 10
    end = i + 10
    
    if start < 0:
        start = 0
    if end > global_temp[13].size:
        end = global_temp[13].size

    smoothed[i] = global_temp[13][start:end].sum() / global_temp[13][start:end].size

plt.plot(global_temp[0],smoothed/100.0,'g-',linewidth=3)

# calculate standard deviation around the smoothed value
print(np.nanstd(global_temp[13]-smoothed)/100.0)

In [ ]:
global_temp_3myr = np.loadtxt('bintanja2008.txt',skiprows=110,unpack=True)

tbegin = 100
tend = 0

plt.subplot(4, 1, 1)
plt.plot(global_temp_3myr[0],global_temp_3myr[4],'k-')
plt.title('Data over 3 Myr')
plt.ylabel('Sfc Temperature [K]')
plt.xlabel('time (kyr)')
plt.xlim(tbegin,0)

plt.subplot(4, 1, 2)
plt.plot(global_temp_3myr[0],global_temp_3myr[3],'r-')
plt.ylabel('deep sea temperature [K]')
plt.xlabel('time (kyr)')
plt.xlim(tbegin,0)

plt.subplot(4, 1, 3)
plt.plot(global_temp_3myr[0],global_temp_3myr[7],'g-')
plt.ylabel('ice volume [m]')
plt.xlabel('time (kyr)')
plt.xlim(tbegin,0)

plt.subplot(4, 1, 4)
plt.plot(global_temp_3myr[0],-global_temp_3myr[8],'b-')
plt.ylabel('sea level [m]')
plt.xlabel('time (kyr)')
plt.xlim(tbegin,0)

In [ ]:
plt.plot(global_temp_3myr[7],-global_temp_3myr[8],'b.')
plt.xlabel('ice thickness [m]')
plt.ylabel('sea level [m]')

In [ ]:
plt.plot(global_temp_3myr[4],-global_temp_3myr[8],'r.')
plt.xlabel('temperature [K]')
plt.ylabel('sea level [m]')

In the cell below, describe some of the conclusions that you've drawn from the data you have just explored!

// put your thoughts here.

Wrapup

Do you have any lingering questions that remain after this project?

// put your answers here!

Turn it in!

Turn this assignment in to the Day 19 dropbox in the "in-class activities" folder.