```
In [1]:
```import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline

- Create a numpy array that has 10 uniform random numbers between 25.0 and 35.0, store it as a variable
- Createa pandas dataframe with one frame based on your numpy array
- Use "np.std" and "pd.std" formulas to calculate the standard deviation from both arrays, do not change any of the default arguments
- The numbers should be different, explain why
- Repeat the exercise with 1E6 uniform random numbers drawn from the same range
- Comment on whether the numbers are different or not and explain why
- Demonstrate that by using the proper function arguments you can obtain the same answer in both methods

- The data file GerberdingElectricityChilledWater.csv shows chilled water (energy) and electricity usage for Gerberding Hall over approximately an 18 month period from Jan 2013 to June 2014.
**Note: these are real data! If you make any executive decisions (e.g., remove some points for very specific reasons), clearly explain your decision and motivation for doing so**Load the data into Python (your choice of method) and prepare a box plot summary of the data.

- Present the plot in the nicest possible format (e.g., improve it from the default and prepare it for publication or presentation format) - this is open ended and you can use your judgement
- Explain in your own words what each part on the box plot means

```
In [2]:
```unirand = np.random.uniform(low = 25.0, high = 35.0, size = 10)
print(unirand)
unidataframe = pd.DataFrame(unirand)
print(unidataframe)
print('\nnp.std = %f; pd.std = %f' %(np.std(unidataframe), unidataframe.std()))
print('By default, degree of freedoms of np.std and pd.std are 0 and 1 respectively.')
unirand = np.random.uniform(low = 25.0, high = 35.0, size = 1000000)
unidataframe = pd.DataFrame(unirand)
print('1E6 random numbers:\nnp.std = %f; pd.std = %f' %(np.std(unidataframe), unidataframe.std()))
print('The valuses are very close, because when n is large enough(infinity), the distribution of the sample(dof=1) will be same as population(dof=0).')
print('We can get same results by using ddof.\ndof = 0:\nnp.std = %f; pd.std(ddof=0) = %f\ndof = 1:\nnp.std(ddof=1) = %f; pd.std = %f' %(np.std(unidataframe), unidataframe.std(ddof = 0),np.std(unidataframe, ddof = 1), unidataframe.std()))

```
```

```
In [3]:
```data = pd.read_csv('GerberdingElectricityChilledWater.csv')
print(data.head())

```
```

```
In [4]:
```plt.figure(figsize=(10,5))
plt.subplot(121)
data.boxplot(['Btu'])
plt.xticks([1], ['Chilled water usuage'])
plt.ylabel('Btu')
plt.subplot(122)
data.boxplot(['kWh'])
plt.xticks([1],['Electricity usuage'])
plt.ylabel('kWh')
plt.suptitle('Chilled Water and Electricity Usage for Gerberding Hall (with outliers)')
print(data.describe())

```
```

```
In [5]:
```IQR = data.quantile(q = 0.75) - data.quantile(q = 0.25)
upermild = data.quantile(q = 0.75) + 1.5*IQR
uperextre = data.quantile(q = 0.75) + 3*IQR
lowermild = data.quantile(q = 0.25) - 1.5*IQR
lowerextre = data.quantile(q = 0.25) - 3*IQR
print(lowerextre)
print(lowermild)
print(upermild)
print(uperextre)

```
```

```
In [6]:
```for index, row in data.iterrows():
if row['Btu'] > uperextre[0]:
data.drop(index,inplace = True)
elif row['kWh'] > uperextre[1]:
data.drop(index,inplace = True)
elif row['kWh'] < 0:
data.drop(index,inplace = True)

```
In [7]:
```#plot again
plt.figure(figsize=(10,5))
plt.subplot(121)
data.boxplot(['Btu'])
plt.xticks([1], ['Chilled water usuage'])
plt.ylabel('Btu')
plt.subplot(122)
data.boxplot(['kWh'])
plt.xticks([1],['Electricity usuage'])
plt.ylabel('kWh')
plt.suptitle('Chilled Water and Electricity Usage for Gerberding Hall (without extreme outliers)')

```
Out[7]:
```

Also, I can plot without showing any outliers, shown below.

```
In [8]:
```plt.figure(figsize=(10,5))
plt.subplot(121)
plt.boxplot(data.Btu,showfliers = False)
plt.xticks([1], ['Chilled water usuage'])
plt.ylabel('Btu')
plt.grid()
plt.subplot(122)
plt.boxplot(data.kWh,showfliers = False)
plt.xticks([1],['Electricity usuage'])
plt.ylabel('kWh')
plt.ylim([-2,42])
plt.grid()
plt.suptitle('Chilled Water and Electricity Usage for Gerberding Hall (without any outliers)')

```
Out[8]:
```

- Your goal is to prepare a side-by-side plot describing the distribution of data from part 1 related to Gerberding Hall.

1) To complete the plot you should assume the data are normally distributed and determine the mean and standard deviation of the chilled water data series.

2) The left panel of the plot should be a normalized histogram of the chilled water data with an **overlay** of the PDF of the normal distribution based on the $\bar x$ and $s$ values from the data series. The right panel of the plot should be contain two lines corresponding to the measured CDF and CDF from the normal distribution estimated from the data.

3) Repeat (2) for the electricity water

4) Comment in a Markdown cell on the ability of a normal distribution to describe this data

```
In [9]:
```from scipy.stats import norm
#left
x=np.arange(lowerextre[0],uperextre[0])
y = norm.pdf(x,data['Btu'].mean(),data['Btu'].std())
plt.figure(figsize=(10,5))
plt.subplot(121)
plt.plot(x, y, label = 'PDF of the normal distribution', linewidth = 1.5,color = 'k')
plt.hist(data['Btu'], bins = 20, normed = True, label = 'normalized histogram', color = '#668cff',edgecolor='k', lw=1.5)
plt.ylim([0,0.000027])
plt.grid()
plt.legend()
plt.xlabel('Chilled water usuage (Btu)')
plt.ylabel('Frequency')
#right
y = norm.cdf(x,data['Btu'].mean(),data['Btu'].std())
counts, bin_edges = np.histogram(data['Btu'], bins= 20, density=True)
cdf = np.cumsum(counts * np.diff(bin_edges))
plt.subplot(122)
plt.plot(x, y, label = 'CDF from the normal distribution',color = 'k')
plt.plot(bin_edges[1:], cdf, label = 'measured CDF',color = '#668cff')
plt.grid()
plt.legend(loc='upper right')
plt.xlabel('Chilled water usuage (Btu)')
plt.ylim([0,1.2])
plt.ylabel('Cumulative frequency')
plt.suptitle('Chilled Water Usage for Gerberding Hall')

```
Out[9]:
```

```
In [10]:
```#left
x=np.arange(lowerextre[1],uperextre[1])
y = norm.pdf(x,data['kWh'].mean(),data['kWh'].std())
plt.figure(figsize=(10,5))
plt.subplot(121)
plt.plot(x, y, label = 'PDF of the normal distribution', linewidth = 1.5,color = 'k')
plt.hist(data['kWh'], bins = 20, normed = True, label = 'normalized histogram', color = '#668cff',edgecolor='k', lw=1.5)
plt.ylim([0,0.14])
plt.grid()
plt.legend()
plt.xlabel('Electricity usuage (kWh)')
plt.ylabel('Frequency')
#right
y = norm.cdf(x,data['kWh'].mean(),data['kWh'].std())
counts, bin_edges = np.histogram(data['kWh'], bins= 20, density=True)
cdf = np.cumsum(counts * np.diff(bin_edges))
plt.subplot(122)
plt.plot(x, y, label = 'CDF from the normal distribution',color = 'k')
plt.plot(bin_edges[1:], cdf, label = 'measured CDF',color = '#668cff')
plt.grid()
plt.legend(loc='upper right')
plt.xlabel('Electricity usuage (kWh)')
plt.ylim([0,1.2])
plt.ylabel('Cumulative frequency')
plt.suptitle('Electricity Usage for Gerberding Hall')

```
Out[10]:
```

- Chilled water usuage is closer to normal distribution than electricity usuage is. And we can see that the CDF from the normal distribution is a good approxmation to the measured CDF, so I think it has a good ability to describe chilled water usuage. In contrast, in electricity usuage, both plots show that PDF and CDF from normal distribution don't fit the measured data very well.

The file energyuse.csv contains energy use data for 6 UW students. The data include electricity for lighting, all other electricity use and total electricity use. The final entry in the data file is the national average of the same values.

Please do the following

1) Formulate a statistical hypothesis to test about the three data sets and clearly state it

2) Perform a test a significance level of P=0.05 (*make sure to clearly comment your work so I can follow what you are doing*)

3) Clearly state the meaning of the results in plain language

```
In [11]:
```eusedata = pd.read_csv('energyuse.csv')
eusedata

```
Out[11]:
```

```
In [12]:
```from scipy import stats
#Calculate statistic value.
mean = eusedata['Total'][0:6].mean()
std = eusedata['Total'][0:6].std(ddof = 1)
N = np.shape(eusedata)[0]-1
sem = stats.sem(eusedata['Total'][0:6],ddof=1)
t = (mean - 3.040)/ sem
p = stats.t.sf(np.abs(t),N-1)*2
print('mean = %.4f, std = %.4f, sem = %.4f, t = %.4f, p = %.4f' %(mean, std, sem, t, p))

```
```

```
In [13]:
```#Confirm t and p
[tcalc,p]=stats.ttest_1samp(eusedata['Total'][0:6],3.040)
tcalc,p

```
Out[13]:
```

- In statistics, the p-value represents the probablity of extreme value by assuming H0 is true. When p-value is smaller enough(under desired significance level, $\alpha$), extreme value has very low probablity, but it still exists. Thus, we have confidence to reject the assumption H0 is true. That is, we can state that Ha is true. In contrast, when p-value is larger than $\alpha$, we can only state that we don't have enough evidence to reject H0.
- In this case, p-value is 0.0192, which is smaller than $\alpha$ = 0.05, so we reject H0 and state that Ha is true. That is, the average total usuage for 6 UW students is not equal to the national Avg = 3.040 in this case.
- I try to plot the p-value concept(based on my understanding) below. We can see that the average total usuage for 6 UW students falls into the rejection area, so we reject H0.

```
In [14]:
```x=np.arange(2.0,4.0,0.01)
#Normal distribution based on mu and sem
y = norm.pdf(x,3.040,sem)
plt.figure(figsize=(10,10))
plt.subplot(111)
plt.plot(x, y, label = 'Normal distribution by sem', linewidth = 1.5,color = 'k')
plt.xlabel('Total energy usuage')
plt.ylabel('Frequency')
plt.ylim([-0.15,2.25])
plt.xlim([1.8,4.2])
#tag mu
x = np.full(2,3.040)
plt.plot(x,[-0.15,2.0],'--',label = '$\mu$ = 3.040')
#tag mean
x = np.full(2,mean)
plt.plot(x,[-0.15, p],'--',label = 'mean = 2.363, p=0.199', color = 'g')
y = np.full(2,p)
plt.plot([1.8,mean],y,'--',color = 'g')
#tag alpha
[L,U]=stats.t.interval(.95,N-1,loc=3.040, scale=sem)
x = np.full(2,L)
plt.plot(x,[-0.15, 0.05],label = 'Rejection area based on alpha = 0.05', color = 'g')
y = np.full(2,0.05)
plt.plot([1.8,L],y,color = 'g')
plt.grid()
plt.legend(loc='upper right')

```
Out[14]:
```

```
In [ ]:
```