In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Review

Before we start let's do some quick review from last time.

  1. Let's create a new numpy arrray. Fill it with any values you want.
  2. Let's say I want to calculate the average between the 1st and 2nd values, 2nd and 3rd value, and so on. How should I do it?

Creating a new numpy arrray


In [4]:
k = np.array([1,3,5,5,6,6,7,10,23123123,31232]) # create a new array
k


Out[4]:
array([       1,        3,        5,        5,        6,        6,
              7,       10, 23123123,    31232])

In [161]:
k[1]


Out[161]:
3

In [162]:
k[0]


Out[162]:
1

In [163]:
k[4]


Out[163]:
6

In [171]:
k[10] # even though the list has 9 items, the index actually goes from 0 to 9


---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-171-c20ee5804602> in <module>()
----> 1 k[10] # even though the list has 9 items, the index actually goes from 0 to 9

IndexError: index 10 is out of bounds for axis 0 with size 10

In [165]:
k[9]


Out[165]:
31232

So what happens if you want to access the last value but don't know how many items are there in the array?


In [172]:
k.size # the length of the list/vector is still and always be 10


Out[172]:
10

In [173]:
k[k.size] # again, index starts at 0!


---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-173-860b0d4aae17> in <module>()
----> 1 k[k.size] # again, index starts at 0!

IndexError: index 10 is out of bounds for axis 0 with size 10

In [174]:
k[k.size-1] # yay


Out[174]:
31232

Alternatively, we can also write k[-1] to access the last value.


In [170]:
k[-1]


Out[170]:
31232

In [176]:
k[-3]


Out[176]:
10

In [159]:
print(k[3:])
print(k[:-4])
print(k[1:])
print(k[:-1])


[       5        6        6        7       10 23123123    31232]
[1 3 5 5 6 6]
[       3        5        5        6        6        7       10 23123123
    31232]
[       1        3        5        5        6        6        7       10
 23123123]

Geometric Mean (Average)

Let's say I want to calculate the average between the 1st and 2nd values, 2nd and 3rd value, and so on. How should I do it?


In [153]:
print(k)
print(k[1:])
print(k[:-1])


[       1        3        5        5        6        6        7       10
 23123123    31232]
[       3        5        5        6        6        7       10 23123123
    31232]
[       1        3        5        5        6        6        7       10
 23123123]

In [135]:
average_k = (k[1:]+k[:-1])/2
average


Out[135]:
array([  2.00000000e+00,   4.00000000e+00,   5.00000000e+00,
         5.50000000e+00,   6.00000000e+00,   6.50000000e+00,
         8.50000000e+00,   1.15615665e+07,   1.15771775e+07])

To calculate the average of the whole vector, of course, add all the numbers up and divide by the number of values.


In [149]:
k_sum = np.sum(k) # add all numbers up
k.size
#k_sum/k.size # divide by total number of values


Out[149]:
10

In [142]:
np.mean(k) # verify with numpy's own function


Out[142]:
2315439.7999999998

Harmonic Mean

Let's say I want to harmonic mean between the 1st and 2nd values, 2nd and 3rd value, and so on. The formula is:

$$H = 2 (\frac{1}{k_i} + \frac{1}{k_{i+1}})^{-1}$$

And for the whole vector, it would be . . .


In [139]:
harmonic_mean_k = 2*(1/k[:-1] + 1/k[1:])**-1
harmonic_mean_k


Out[139]:
array([  1.50000000e+00,   3.75000000e+00,   5.00000000e+00,
         5.45454545e+00,   6.00000000e+00,   6.46153846e+00,
         8.23529412e+00,   1.99999914e+01,   6.23797448e+04])

In [14]:
print("ki")
print(k)
print()
print("1/ki")
print(1/k)
print()
print("Sum of all 1/ki")
print(sum(1/k))
print()
print("size/ (Sum of all 1/ki)")
k.size / np.sum(1.0/k)


ki
[       1        3        5        5        6        6        7       10
 23123123    31232]

1/ki
[  1.00000000e+00   3.33333333e-01   2.00000000e-01   2.00000000e-01
   1.66666667e-01   1.66666667e-01   1.42857143e-01   1.00000000e-01
   4.32467535e-08   3.20184426e-05]

Sum of all 1/ki
2.30955587121

size/ (Sum of all 1/ki)
Out[14]:
4.3298367987725292

Plotting sin(x) vs. x


In [184]:
xs = np.linspace(0, 2*np.pi, 100)
print(xs)
ys = np.sin(xs) # np.sin is a universal function
print(ys)
plt.plot(xs, ys);


[ 0.          0.06346652  0.12693304  0.19039955  0.25386607  0.31733259
  0.38079911  0.44426563  0.50773215  0.57119866  0.63466518  0.6981317
  0.76159822  0.82506474  0.88853126  0.95199777  1.01546429  1.07893081
  1.14239733  1.20586385  1.26933037  1.33279688  1.3962634   1.45972992
  1.52319644  1.58666296  1.65012947  1.71359599  1.77706251  1.84052903
  1.90399555  1.96746207  2.03092858  2.0943951   2.15786162  2.22132814
  2.28479466  2.34826118  2.41172769  2.47519421  2.53866073  2.60212725
  2.66559377  2.72906028  2.7925268   2.85599332  2.91945984  2.98292636
  3.04639288  3.10985939  3.17332591  3.23679243  3.30025895  3.36372547
  3.42719199  3.4906585   3.55412502  3.61759154  3.68105806  3.74452458
  3.8079911   3.87145761  3.93492413  3.99839065  4.06185717  4.12532369
  4.1887902   4.25225672  4.31572324  4.37918976  4.44265628  4.5061228
  4.56958931  4.63305583  4.69652235  4.75998887  4.82345539  4.88692191
  4.95038842  5.01385494  5.07732146  5.14078798  5.2042545   5.26772102
  5.33118753  5.39465405  5.45812057  5.52158709  5.58505361  5.64852012
  5.71198664  5.77545316  5.83891968  5.9023862   5.96585272  6.02931923
  6.09278575  6.15625227  6.21971879  6.28318531]
[  0.00000000e+00   6.34239197e-02   1.26592454e-01   1.89251244e-01
   2.51147987e-01   3.12033446e-01   3.71662456e-01   4.29794912e-01
   4.86196736e-01   5.40640817e-01   5.92907929e-01   6.42787610e-01
   6.90079011e-01   7.34591709e-01   7.76146464e-01   8.14575952e-01
   8.49725430e-01   8.81453363e-01   9.09631995e-01   9.34147860e-01
   9.54902241e-01   9.71811568e-01   9.84807753e-01   9.93838464e-01
   9.98867339e-01   9.99874128e-01   9.96854776e-01   9.89821442e-01
   9.78802446e-01   9.63842159e-01   9.45000819e-01   9.22354294e-01
   8.95993774e-01   8.66025404e-01   8.32569855e-01   7.95761841e-01
   7.55749574e-01   7.12694171e-01   6.66769001e-01   6.18158986e-01
   5.67059864e-01   5.13677392e-01   4.58226522e-01   4.00930535e-01
   3.42020143e-01   2.81732557e-01   2.20310533e-01   1.58001396e-01
   9.50560433e-02   3.17279335e-02  -3.17279335e-02  -9.50560433e-02
  -1.58001396e-01  -2.20310533e-01  -2.81732557e-01  -3.42020143e-01
  -4.00930535e-01  -4.58226522e-01  -5.13677392e-01  -5.67059864e-01
  -6.18158986e-01  -6.66769001e-01  -7.12694171e-01  -7.55749574e-01
  -7.95761841e-01  -8.32569855e-01  -8.66025404e-01  -8.95993774e-01
  -9.22354294e-01  -9.45000819e-01  -9.63842159e-01  -9.78802446e-01
  -9.89821442e-01  -9.96854776e-01  -9.99874128e-01  -9.98867339e-01
  -9.93838464e-01  -9.84807753e-01  -9.71811568e-01  -9.54902241e-01
  -9.34147860e-01  -9.09631995e-01  -8.81453363e-01  -8.49725430e-01
  -8.14575952e-01  -7.76146464e-01  -7.34591709e-01  -6.90079011e-01
  -6.42787610e-01  -5.92907929e-01  -5.40640817e-01  -4.86196736e-01
  -4.29794912e-01  -3.71662456e-01  -3.12033446e-01  -2.51147987e-01
  -1.89251244e-01  -1.26592454e-01  -6.34239197e-02  -2.44929360e-16]

In [180]:
xs = np.arange(10)
print (xs)
print (-xs)
print (xs+xs)
print (xs*xs)
print (xs**3)
print (xs < 5)


[0 1 2 3 4 5 6 7 8 9]
[ 0 -1 -2 -3 -4 -5 -6 -7 -8 -9]
[ 0  2  4  6  8 10 12 14 16 18]
[ 0  1  4  9 16 25 36 49 64 81]
[  0   1   8  27  64 125 216 343 512 729]
[ True  True  True  True  True False False False False False]

Let's read the GASISData.csv file


In [14]:
data = pd.read_csv(r'C:\Users\jenng\Documents\texaspse-blog\media\f16-scientific-python\week2\GASISData.csv')


C:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2723: DtypeWarning: Columns (23,24,27,28,29,51,52,61,126,135,136,141,142,143,145,146) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

In [63]:
#Show a preview of the GASIS data
data.head()


Out[63]:
LINKA STATE FLDNAME RESNAME R_STUDY PLAYNAME PLAYCOD SUBPLAYN SUBPLAY DOEFLD ... HELIUM OTHER HEAT ATLASREG RTYPEGOR USGSPROV USGSPLAY S_USGSPL PLAYDEPO TYPELOGF
0 1 ALABAMA ALOE BAY NORPHLET A UPPER JURASSIC AGGRADATIONAL SANDSTONE UU A.1 NaN NaN 14081.0 ... 0.00 0.0 0 GM G NaN NaN NaN NaN NaN
1 2 ALABAMA BEAVERTON CARTER NE R PENNSYLVANIAN-MISSISSIPPIAN DELTAIC AND MARINE... GCPN-MS-1 NaN NaN 50317.0 ... 0.07 0.0 1005 EG G 65.0 6502.0 O DELTA AL000001
2 3 ALABAMA BEAVERTON CARTER SE R PENNSYLVANIAN-MISSISSIPPIAN DELTAIC AND MARINE... GCPN-MS-1 NaN NaN 50317.0 ... 0.07 0.0 1005 EG G 65.0 6502.0 O DELTA AL000002
3 4 ALABAMA BEAVERTON LEWIS R PENNSYLVANIAN-MISSISSIPPIAN DELTAIC AND MARINE... GCPN-MS-1 NaN NaN 50317.0 ... 0.00 0.0 0 EG G 65.0 6502.0 O DELTA AL000003
4 5 ALABAMA BETHEL CHURCH CARTER R PENNSYLVANIAN-MISSISSIPPIAN DELTAIC AND MARINE... GCPN-MS-1 NaN NaN 1994.0 ... 0.08 0.5 1004 EG G 65.0 6502.0 O DELTA AL000004

5 rows × 186 columns


In [18]:
# It's not required but let's see how large this file is
row, col = data.shape 
# data.shape will result in a (19220, 186). 
# We are just recasting these values to equal row and col respectively
print("Number of Rows: {}".format(row))
print("Number of Column: {}".format(col))


Number of Rows: 19220
Number of Column: 186

Creating a Cumulative Distributive Function (CDF)

If you do some statistics work, you will know there's a thing called Cumulative Distribution Function. Which basically is counting what's the chance that a random number you pick within the data set is going to be less than the current number.

1st way of plotting the CDF for Permeability [k]

  1. Use the ['AVPERM'] from the data set for this example.
  2. Remove all zeros within the column since it's a fluff value.
  3. Sort the column from smallest to largest
  4. Create a another column called Frequency [%] that calculates the row number by the total size of the column. For example, the first sorted value will be 1/size, second sorted value will be 2/size, and so on.
  5. Plot AVGPERM vs. the Frequency.
  6. Again, Plot AVGPERM vs. the Frequency but using log scale.

In [85]:
# drop all 0s
#avg_permeability = avg_permeability[avg_permeability !=0] # drop all values that are 0 because it's fluff
#avg_permeability = avg_permeability.sort_values() # sort permeability from smallest to largest
#n = np.linspace(1,avg_permeability.size,avg_permeability.size) # create a new thing that goes from 1, 2, ... size of the column
#frequency = n/avg_permeability.size # calculate the

In [88]:
avg_k = data['AVPERM'] # assign a single dataframe to this variable avg_k. At this point, avg_k will be of type pandas.Series
avg_k = avg_k[avg_k != 0] # drop all values that are 0
cdf_df = avg_k.to_frame() # convert series to a dataframe, assign it to variable cdf_df
cdf_df = cdf_df.sort_values(by='AVPERM') # sort dataset from smallest to largest by column AVPERM
total_num_of_values = avg_k.size # find size of column, assign it to variable total_num_of_values
# create a new column called Count that goes from 1,2..total_num_of_values
cdf_df['Count'] = np.linspace(1,total_num_of_values,total_num_of_values) 
# create a new column called Frequency that divides Count by total num of val
cdf_df['Frequency'] = cdf_df.Count/total_num_of_values 
print(cdf_df)


        AVPERM   Count  Frequency
16837     0.01     1.0   0.000496
18901     0.01     2.0   0.000992
13577     0.01     3.0   0.001487
17110     0.01     4.0   0.001983
13925     0.01     5.0   0.002479
12922     0.01     6.0   0.002975
12894     0.01     7.0   0.003471
17513     0.01     8.0   0.003966
9998      0.02     9.0   0.004462
19151     0.02    10.0   0.004958
941       0.02    11.0   0.005454
19043     0.02    12.0   0.005949
19178     0.02    13.0   0.006445
17625     0.02    14.0   0.006941
16799     0.03    15.0   0.007437
19047     0.03    16.0   0.007933
8448      0.04    17.0   0.008428
11107     0.04    18.0   0.008924
18670     0.05    19.0   0.009420
18627     0.05    20.0   0.009916
18304     0.05    21.0   0.010412
18108     0.05    22.0   0.010907
18688     0.05    23.0   0.011403
12712     0.05    24.0   0.011899
14701     0.05    25.0   0.012395
7540      0.06    26.0   0.012890
18005     0.06    27.0   0.013386
12139     0.07    28.0   0.013882
18948     0.08    29.0   0.014378
19073     0.09    30.0   0.014874
...        ...     ...        ...
17712  2200.00  1988.0   0.985622
1380   2200.00  1989.0   0.986118
14652  2260.00  1990.0   0.986614
23     2289.00  1991.0   0.987110
17711  2290.00  1992.0   0.987605
17995  2300.00  1993.0   0.988101
9978   2300.00  1994.0   0.988597
17757  2344.00  1995.0   0.989093
17759  2344.00  1996.0   0.989588
17761  2344.00  1997.0   0.990084
17929  2350.00  1998.0   0.990580
16718  2400.00  1999.0   0.991076
1085   2400.00  2000.0   0.991572
1053   2440.00  2001.0   0.992067
9967   2500.00  2002.0   0.992563
10009  2500.00  2003.0   0.993059
12986  2500.00  2004.0   0.993555
3212   2565.00  2005.0   0.994051
12908  2575.00  2006.0   0.994546
15592  2800.00  2007.0   0.995042
3808   2810.00  2008.0   0.995538
10008  3080.00  2009.0   0.996034
15935  3100.00  2010.0   0.996529
17731  3200.00  2011.0   0.997025
15962  3333.00  2012.0   0.997521
12917  3550.00  2013.0   0.998017
17179  4500.00  2014.0   0.998513
15931  4861.00  2015.0   0.999008
15784  5085.00  2016.0   0.999504
15989  8200.00  2017.0   1.000000

[2017 rows x 3 columns]

In [89]:
# plot
plt.plot(cdf_df.AVPERM, cdf_df.Frequency,label="CDF")
plt.scatter(cdf_df.AVPERM, cdf_df.Frequency,label="Data")
plt.xlabel("Avg")
plt.ylabel("Frequency")
plt.title("Plot")
plt.legend()
plt.show()



In [90]:
# plot
plt.semilogx(cdf_df.AVPERM, cdf_df.Frequency,label="CDF",color='purple',marker=".")
plt.xlabel("Avg")
plt.ylabel("Frequency")
plt.title("Plot")
plt.legend()
plt.show()


2nd Way for plotting a Cumulative Distributive Function for Permeability

  1. Use the ['AVPERM'] from the data set for this example.
  2. Remove all zeros within the column since it's a fluff value.
  3. Create a new Count column. In this column, find how many other permeability values in the column is less than the current value.
  4. Create a new Frequency [%] column that is Count/total size of whole column.

Take Home Problem:

  • Try to create the CDF for Porosity ['AVPOR'] using the first method.
  • Try to create the CDF for Permeability and Porosity using the second method.

In [116]:
avg_p = data['AVPOR'] # assign a single dataframe to this variable avg_k. At this point, avg_k will be of type pandas.Series
avg_p = avg_p[avg_p != 0]
pdf_df = avg_p.to_frame()
pdf_df.head()


Out[116]:
AVPOR
0 11.0
1 15.6
2 12.0
3 10.3
4 10.0

In [118]:
pdf_df.hist(column='AVPOR',bins=10)


Out[118]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000210387CAB38>]], dtype=object)

In [119]:
pdf_df.hist(column='AVPOR',bins=100)


Out[119]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000210387C7F98>]], dtype=object)

In [125]:
# let's try to see what would happen if we change the x-axis to log scale
# There's really no reason why we are doing this
fig, ax = plt.subplots()
pdf_df.hist(ax=ax,column='AVPOR',bins=100)
ax.set_yscale('log')



In [123]:
#let's try to see what would happen if we change the y-axis to log scale
fig, ax = plt.subplots()
pdf_df.hist(ax=ax,column='AVPOR',bins=100)
ax.set_xscale('log')


Creating a Probability Density Function (PDF)

A probability density function (PDF), or density of a continuous random variable, is a function that describes the relative likelihood for this random variable to take on a given value.

You will probably see this quite a bit. The plot belows show the box plot and the probability density function of a normal distribution, or a Gaussian distribution.

As you can see, the big difference between the histogram we saw earlier and this plot is that the histogram is broken up by chunks, while this plot is more continuous.

The histogram is represented by bar charts while PDF is traditionally represented with a smooth line.

Oh! And the area under a PDF curve is equal to 1.


In [127]:
pdf_df.plot.density()


Out[127]:
<matplotlib.axes._subplots.AxesSubplot at 0x210383664e0>