In [17]:
%matplotlib inline
import random
import numpy as np
import matplotlib.pyplot as plt
import scipy
import scipy.stats
import seaborn as sns
plt.style.use('seaborn-whitegrid')
import pydataset
Answer these in Markdown
Answer in Python
[4 points] Load the Nile dataset and convert to a numpy array. It contains measurements of the annual flow of the river Nile at Aswan. Make a scatter plot of the year vs flow rate. If you get an error when loading pydataset
that says No Module named 'pydataset'
, then execute this code in a new cell once: !pip install pydataset
[2 points] Report the correlation coefficient between year and flow rate.
[4 points] Create a histogram of the flow rates and show the median with a vertical line. Labels your axes and make a legend indicating what the vertical line is.
In [35]:
#2.1
nile = pydataset.data('Nile').as_matrix()
plt.plot(nile[:,0], nile[:,1], '-o')
plt.xlabel('Year')
plt.ylabel('Nile Flow Rate')
plt.show()
In [34]:
#2.2
print('{:.3}'.format(np.corrcoef(nile[:,0], nile[:,1])[0,1]))
In [44]:
#2.3 ok to distplot or plt.hist
sns.distplot(nile[:,1])
plt.axvline(np.mean(nile[:,1]), color='C2', label='Mean')
plt.legend()
plt.xlabel('Flow Rate')
plt.show()
Answer in Python
[2 points] Load the 'InsectSpray' dataset, convert to a numpy array and print the number of rows and columns. Recall that numpy arrays can only hold one type of data (e.g., string, float, int). What is the data type of the loaded dataset?
[2 points] Using np.unique
, print out the list of insect spray used. This data is a count insects on a crop field with various insect sprays.
[4 points] Create a violin plot of the data. Label your axes.
[2 points] Which insect spray worked best? What is the mean number of insects for the best insect spray?
In [47]:
#1.1
insect = pydataset.data('InsectSprays').as_matrix()
print(insect.shape, 'string or object is acceptable')
In [48]:
#1.2
print(np.unique(insect[:,1]))
In [50]:
#1.3
labels = np.unique(insect[:,1])
ldata = []
#slice out each set of rows that matches label
#and add to list
for l in labels:
ldata.append(insect[insect[:,1] == l, 0].astype(float))
sns.violinplot(data=ldata)
plt.xticks(range(len(labels)), labels)
plt.xlabel('Insecticide Type')
plt.ylabel('Insect Count')
plt.show()
In [54]:
#1.4
print('C is best and its mean is {:.2}'.format(np.mean(ldata[2])))
Load the 'airquality' dataset and convert into to a numpy array. Make a scatter plot of wind (column 2, mph) and ozone concentration (column 0, ppb). Using the plt.text
command, display the correlation coefficient in the plot. This data as nan
, which means "not a number". You can select non-nans by using x[~numpy.isnan(x)]
. You'll need to remove these to calculate correlation coefficient.
In [71]:
nyair = pydataset.data('airquality').as_matrix()
plt.plot(nyair[:,2], nyair[:,0], 'o')
plt.xlabel('Wind [mph]')
plt.ylabel('Ozone [ppb]')
nans = np.isnan(nyair[:,0])
r = np.corrcoef(nyair[~nans,2], nyair[~nans,0])[0,1]
plt.text(10, 130, 'Correlation Coefficient = {:.2}'.format(r))
plt.show()