Answer these in Markdown
a[5:2]
? How can you tell without counting?[4 points] Load the cars dataset and create a scatter plot. It contains measurements a cars' stopping distance in feet as a function of speed in mph. If you get an error when loading pydataset
that says No Module named 'pydataset'
, then execute this code in a new cell once: !pip install --user pydataset
[4 points] Compute the sample correlation coefficient between stopping distance and speed in python and report your answer by writing a complete sentence in Markdown.
[2 points] Why might there be multiple stopping distances for a single speed?
In [22]:
#2.1
import pydataset
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
cars = pydataset.data('cars').values
plt.plot(cars[:,0], cars[:,1], 'o')
plt.xlabel('Speed [mph]')
plt.ylabel('Stopping Distance [ft]')
plt.show()
In [59]:
#1.2
np.corrcoef(cars[:,0].astype(float), cars[:,1].astype(float))
Out[59]:
pydataset.data('Housing', show_doc=True)
to see information about the dataset. Use the snippet below to format your ticks with dollar signs and commas for thousands. Note that this data is from the 1970s. Assess the correlation between lotsize and price. Use plots and sample correlation coefficient as evidence to support a written answer.import matplotllib.ticker
fmt = '${x:,.0f}'
tick = matplotllib.ticker.StrMethodFormatter(fmt)
plt.gca().yaxis.set_major_formatter(tick)
[8 points] Use a violin plot to show if being in a preferred neighborhood affects price. You may use any other calculations (e.g., sample standard deviation) to support your conclusions. Write out your conclusion.
[8 points] Use a boxplot to determine if bedroom number affects price. What is your conclusion?
In [71]:
import matplotlib.ticker
fmt = '${x:,.0f}'
tick = matplotlib.ticker.StrMethodFormatter(fmt)
fmt = '{x:,.0f}'
xtick = matplotlib.ticker.StrMethodFormatter(fmt)
house = pydataset.data('Housing').values
plt.gca().yaxis.set_major_formatter(tick)
plt.gca().xaxis.set_major_formatter(xtick)
plt.plot(house[:, 0], house[:,1], 'o')
plt.ylabel('House Price')
plt.xlabel('Lot Size [sq ft]')
np.corrcoef(house[:,0].astype(np.float), house[:,1].astype(np.float))
Out[71]:
There is a weak correlation. The correlation coefficient is low at 0.53, but there is so much data that we can see a weak correlation especially at small lot sizes.
In [60]:
import seaborn as sns
p = house[:,-1] == 'yes'
In [69]:
sns.violinplot(data=[house[p,0], house[~p,0]])
plt.gca().yaxis.set_major_formatter(tick)
print(np.median(house[p,0]))
print(np.median(house[~p,0]))
plt.xticks(range(2), ['Preferred Neighborhood', 'Normal Neighborhood'])
plt.ylabel('House Price')
plt.show()
The preferred neighborhood has a \$20,000 higher median price and has a longer tail at high prices, indicating many expensive homes.
In [51]:
labels = np.unique(house[:,2])
ldata = []
#slice out each set of rows that matches label
#and add to list
for l in labels:
ldata.append(house[house[:,2] == l, 0].astype(float))
In [68]:
sns.boxplot(data=ldata)
plt.xticks(range(len(labels)), labels)
plt.xlabel('Number of Bedrooms')
plt.gca().yaxis.set_major_formatter(tick)
plt.ylabel('House Price')
plt.show()
The number of bedrooms is important up until 4, after which it seems to have less effect. Having 1 bedroom has a very narrow distribution. There appears to be a correlation overall with bedroom number.