In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
In [4]:
k = np.array([1,3,5,5,6,6,7,10,23123123,31232]) # create a new array
k
Out[4]:
In [161]:
k[1]
Out[161]:
In [162]:
k[0]
Out[162]:
In [163]:
k[4]
Out[163]:
In [171]:
k[10] # even though the list has 9 items, the index actually goes from 0 to 9
In [165]:
k[9]
Out[165]:
So what happens if you want to access the last value but don't know how many items are there in the array?
In [172]:
k.size # the length of the list/vector is still and always be 10
Out[172]:
In [173]:
k[k.size] # again, index starts at 0!
In [174]:
k[k.size-1] # yay
Out[174]:
Alternatively, we can also write k[-1] to access the last value.
In [170]:
k[-1]
Out[170]:
In [176]:
k[-3]
Out[176]:
In [159]:
print(k[3:])
print(k[:-4])
print(k[1:])
print(k[:-1])
Let's say I want to calculate the average between the 1st and 2nd values, 2nd and 3rd value, and so on. How should I do it?
In [153]:
print(k)
print(k[1:])
print(k[:-1])
In [135]:
average_k = (k[1:]+k[:-1])/2
average
Out[135]:
To calculate the average of the whole vector, of course, add all the numbers up and divide by the number of values.
In [149]:
k_sum = np.sum(k) # add all numbers up
k.size
#k_sum/k.size # divide by total number of values
Out[149]:
In [142]:
np.mean(k) # verify with numpy's own function
Out[142]:
Let's say I want to harmonic mean between the 1st and 2nd values, 2nd and 3rd value, and so on. The formula is:
$$H = 2 (\frac{1}{k_i} + \frac{1}{k_{i+1}})^{-1}$$And for the whole vector, it would be . . .
In [139]:
harmonic_mean_k = 2*(1/k[:-1] + 1/k[1:])**-1
harmonic_mean_k
Out[139]:
In [14]:
print("ki")
print(k)
print()
print("1/ki")
print(1/k)
print()
print("Sum of all 1/ki")
print(sum(1/k))
print()
print("size/ (Sum of all 1/ki)")
k.size / np.sum(1.0/k)
Out[14]:
In [184]:
xs = np.linspace(0, 2*np.pi, 100)
print(xs)
ys = np.sin(xs) # np.sin is a universal function
print(ys)
plt.plot(xs, ys);
In [180]:
xs = np.arange(10)
print (xs)
print (-xs)
print (xs+xs)
print (xs*xs)
print (xs**3)
print (xs < 5)
Let's read the GASISData.csv file
In [14]:
data = pd.read_csv(r'C:\Users\jenng\Documents\texaspse-blog\media\f16-scientific-python\week2\GASISData.csv')
In [63]:
#Show a preview of the GASIS data
data.head()
Out[63]:
In [18]:
# It's not required but let's see how large this file is
row, col = data.shape
# data.shape will result in a (19220, 186).
# We are just recasting these values to equal row and col respectively
print("Number of Rows: {}".format(row))
print("Number of Column: {}".format(col))
If you do some statistics work, you will know there's a thing called Cumulative Distribution Function. Which basically is counting what's the chance that a random number you pick within the data set is going to be less than the current number.
In [85]:
# drop all 0s
#avg_permeability = avg_permeability[avg_permeability !=0] # drop all values that are 0 because it's fluff
#avg_permeability = avg_permeability.sort_values() # sort permeability from smallest to largest
#n = np.linspace(1,avg_permeability.size,avg_permeability.size) # create a new thing that goes from 1, 2, ... size of the column
#frequency = n/avg_permeability.size # calculate the
In [88]:
avg_k = data['AVPERM'] # assign a single dataframe to this variable avg_k. At this point, avg_k will be of type pandas.Series
avg_k = avg_k[avg_k != 0] # drop all values that are 0
cdf_df = avg_k.to_frame() # convert series to a dataframe, assign it to variable cdf_df
cdf_df = cdf_df.sort_values(by='AVPERM') # sort dataset from smallest to largest by column AVPERM
total_num_of_values = avg_k.size # find size of column, assign it to variable total_num_of_values
# create a new column called Count that goes from 1,2..total_num_of_values
cdf_df['Count'] = np.linspace(1,total_num_of_values,total_num_of_values)
# create a new column called Frequency that divides Count by total num of val
cdf_df['Frequency'] = cdf_df.Count/total_num_of_values
print(cdf_df)
In [89]:
# plot
plt.plot(cdf_df.AVPERM, cdf_df.Frequency,label="CDF")
plt.scatter(cdf_df.AVPERM, cdf_df.Frequency,label="Data")
plt.xlabel("Avg")
plt.ylabel("Frequency")
plt.title("Plot")
plt.legend()
plt.show()
In [90]:
# plot
plt.semilogx(cdf_df.AVPERM, cdf_df.Frequency,label="CDF",color='purple',marker=".")
plt.xlabel("Avg")
plt.ylabel("Frequency")
plt.title("Plot")
plt.legend()
plt.show()
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.hist.html
In [116]:
avg_p = data['AVPOR'] # assign a single dataframe to this variable avg_k. At this point, avg_k will be of type pandas.Series
avg_p = avg_p[avg_p != 0]
pdf_df = avg_p.to_frame()
pdf_df.head()
Out[116]:
In [118]:
pdf_df.hist(column='AVPOR',bins=10)
Out[118]:
In [119]:
pdf_df.hist(column='AVPOR',bins=100)
Out[119]:
In [125]:
# let's try to see what would happen if we change the x-axis to log scale
# There's really no reason why we are doing this
fig, ax = plt.subplots()
pdf_df.hist(ax=ax,column='AVPOR',bins=100)
ax.set_yscale('log')
In [123]:
#let's try to see what would happen if we change the y-axis to log scale
fig, ax = plt.subplots()
pdf_df.hist(ax=ax,column='AVPOR',bins=100)
ax.set_xscale('log')
A probability density function (PDF), or density of a continuous random variable, is a function that describes the relative likelihood for this random variable to take on a given value.
You will probably see this quite a bit. The plot belows show the box plot and the probability density function of a normal distribution, or a Gaussian distribution.
As you can see, the big difference between the histogram we saw earlier and this plot is that the histogram is broken up by chunks, while this plot is more continuous.
The histogram is represented by bar charts while PDF is traditionally represented with a smooth line.
Oh! And the area under a PDF curve is equal to 1.
In [127]:
pdf_df.plot.density()
Out[127]: