In [3]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Review

Before we start let's do some quick review from last time.

Let's create a new numpy arrray. Fill it with any values you want.
Let's say I want to calculate the average between the 1st and 2nd values, 2nd and 3rd value, and so on. How should I do it?

Creating a new numpy arrray



In [4]:

    
k = np.array([1,3,5,5,6,6,7,10,23123123,31232]) # create a new array
k









    Out[4]:





array([       1,        3,        5,        5,        6,        6,
              7,       10, 23123123,    31232])



In [161]:

    
k[1]









    Out[161]:





3



In [162]:

    
k[0]









    Out[162]:





1



In [163]:

    
k[4]









    Out[163]:





6



In [171]:

    
k[10] # even though the list has 9 items, the index actually goes from 0 to 9









    



---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-171-c20ee5804602> in <module>()
----> 1 k[10] # even though the list has 9 items, the index actually goes from 0 to 9

IndexError: index 10 is out of bounds for axis 0 with size 10



In [165]:

    
k[9]









    Out[165]:





31232

So what happens if you want to access the last value but don't know how many items are there in the array?



In [172]:

    
k.size # the length of the list/vector is still and always be 10









    Out[172]:





10



In [173]:

    
k[k.size] # again, index starts at 0!









    



---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-173-860b0d4aae17> in <module>()
----> 1 k[k.size] # again, index starts at 0!

IndexError: index 10 is out of bounds for axis 0 with size 10



In [174]:

    
k[k.size-1] # yay









    Out[174]:





31232

Alternatively, we can also write k[-1] to access the last value.



In [170]:

    
k[-1]









    Out[170]:





31232



In [176]:

    
k[-3]









    Out[176]:





10



In [159]:

    
print(k[3:])
print(k[:-4])
print(k[1:])
print(k[:-1])









    



[       5        6        6        7       10 23123123    31232]
[1 3 5 5 6 6]
[       3        5        5        6        6        7       10 23123123
    31232]
[       1        3        5        5        6        6        7       10
 23123123]

Geometric Mean (Average)

Let's say I want to calculate the average between the 1st and 2nd values, 2nd and 3rd value, and so on. How should I do it?



In [153]:

    
print(k)
print(k[1:])
print(k[:-1])









    



[       1        3        5        5        6        6        7       10
 23123123    31232]
[       3        5        5        6        6        7       10 23123123
    31232]
[       1        3        5        5        6        6        7       10
 23123123]



In [135]:

    
average_k = (k[1:]+k[:-1])/2
average









    Out[135]:





array([  2.00000000e+00,   4.00000000e+00,   5.00000000e+00,
         5.50000000e+00,   6.00000000e+00,   6.50000000e+00,
         8.50000000e+00,   1.15615665e+07,   1.15771775e+07])

To calculate the average of the whole vector, of course, add all the numbers up and divide by the number of values.



In [149]:

    
k_sum = np.sum(k) # add all numbers up
k.size
#k_sum/k.size # divide by total number of values









    Out[149]:





10



In [142]:

    
np.mean(k) # verify with numpy's own function









    Out[142]:





2315439.7999999998

Harmonic Mean

Let's say I want to harmonic mean between the 1st and 2nd values, 2nd and 3rd value, and so on. The formula is:

$$H = 2 (\frac{1}{k_i} + \frac{1}{k_{i+1}})^{-1}$$

And for the whole vector, it would be . . .



In [139]:

    
harmonic_mean_k = 2*(1/k[:-1] + 1/k[1:])**-1
harmonic_mean_k









    Out[139]:





array([  1.50000000e+00,   3.75000000e+00,   5.00000000e+00,
         5.45454545e+00,   6.00000000e+00,   6.46153846e+00,
         8.23529412e+00,   1.99999914e+01,   6.23797448e+04])



In [14]:

    
print("ki")
print(k)
print()
print("1/ki")
print(1/k)
print()
print("Sum of all 1/ki")
print(sum(1/k))
print()
print("size/ (Sum of all 1/ki)")
k.size / np.sum(1.0/k)









    



ki
[       1        3        5        5        6        6        7       10
 23123123    31232]

1/ki
[  1.00000000e+00   3.33333333e-01   2.00000000e-01   2.00000000e-01
   1.66666667e-01   1.66666667e-01   1.42857143e-01   1.00000000e-01
   4.32467535e-08   3.20184426e-05]

Sum of all 1/ki
2.30955587121

size/ (Sum of all 1/ki)






    Out[14]:





4.3298367987725292

Plotting sin(x) vs. x



In [184]:

    
xs = np.linspace(0, 2*np.pi, 100)
print(xs)
ys = np.sin(xs) # np.sin is a universal function
print(ys)
plt.plot(xs, ys);









    



[ 0.          0.06346652  0.12693304  0.19039955  0.25386607  0.31733259
  0.38079911  0.44426563  0.50773215  0.57119866  0.63466518  0.6981317
  0.76159822  0.82506474  0.88853126  0.95199777  1.01546429  1.07893081
  1.14239733  1.20586385  1.26933037  1.33279688  1.3962634   1.45972992
  1.52319644  1.58666296  1.65012947  1.71359599  1.77706251  1.84052903
  1.90399555  1.96746207  2.03092858  2.0943951   2.15786162  2.22132814
  2.28479466  2.34826118  2.41172769  2.47519421  2.53866073  2.60212725
  2.66559377  2.72906028  2.7925268   2.85599332  2.91945984  2.98292636
  3.04639288  3.10985939  3.17332591  3.23679243  3.30025895  3.36372547
  3.42719199  3.4906585   3.55412502  3.61759154  3.68105806  3.74452458
  3.8079911   3.87145761  3.93492413  3.99839065  4.06185717  4.12532369
  4.1887902   4.25225672  4.31572324  4.37918976  4.44265628  4.5061228
  4.56958931  4.63305583  4.69652235  4.75998887  4.82345539  4.88692191
  4.95038842  5.01385494  5.07732146  5.14078798  5.2042545   5.26772102
  5.33118753  5.39465405  5.45812057  5.52158709  5.58505361  5.64852012
  5.71198664  5.77545316  5.83891968  5.9023862   5.96585272  6.02931923
  6.09278575  6.15625227  6.21971879  6.28318531]
[  0.00000000e+00   6.34239197e-02   1.26592454e-01   1.89251244e-01
   2.51147987e-01   3.12033446e-01   3.71662456e-01   4.29794912e-01
   4.86196736e-01   5.40640817e-01   5.92907929e-01   6.42787610e-01
   6.90079011e-01   7.34591709e-01   7.76146464e-01   8.14575952e-01
   8.49725430e-01   8.81453363e-01   9.09631995e-01   9.34147860e-01
   9.54902241e-01   9.71811568e-01   9.84807753e-01   9.93838464e-01
   9.98867339e-01   9.99874128e-01   9.96854776e-01   9.89821442e-01
   9.78802446e-01   9.63842159e-01   9.45000819e-01   9.22354294e-01
   8.95993774e-01   8.66025404e-01   8.32569855e-01   7.95761841e-01
   7.55749574e-01   7.12694171e-01   6.66769001e-01   6.18158986e-01
   5.67059864e-01   5.13677392e-01   4.58226522e-01   4.00930535e-01
   3.42020143e-01   2.81732557e-01   2.20310533e-01   1.58001396e-01
   9.50560433e-02   3.17279335e-02  -3.17279335e-02  -9.50560433e-02
  -1.58001396e-01  -2.20310533e-01  -2.81732557e-01  -3.42020143e-01
  -4.00930535e-01  -4.58226522e-01  -5.13677392e-01  -5.67059864e-01
  -6.18158986e-01  -6.66769001e-01  -7.12694171e-01  -7.55749574e-01
  -7.95761841e-01  -8.32569855e-01  -8.66025404e-01  -8.95993774e-01
  -9.22354294e-01  -9.45000819e-01  -9.63842159e-01  -9.78802446e-01
  -9.89821442e-01  -9.96854776e-01  -9.99874128e-01  -9.98867339e-01
  -9.93838464e-01  -9.84807753e-01  -9.71811568e-01  -9.54902241e-01
  -9.34147860e-01  -9.09631995e-01  -8.81453363e-01  -8.49725430e-01
  -8.14575952e-01  -7.76146464e-01  -7.34591709e-01  -6.90079011e-01
  -6.42787610e-01  -5.92907929e-01  -5.40640817e-01  -4.86196736e-01
  -4.29794912e-01  -3.71662456e-01  -3.12033446e-01  -2.51147987e-01
  -1.89251244e-01  -1.26592454e-01  -6.34239197e-02  -2.44929360e-16]



In [180]:

    
xs = np.arange(10)
print (xs)
print (-xs)
print (xs+xs)
print (xs*xs)
print (xs**3)
print (xs < 5)









    



[0 1 2 3 4 5 6 7 8 9]
[ 0 -1 -2 -3 -4 -5 -6 -7 -8 -9]
[ 0  2  4  6  8 10 12 14 16 18]
[ 0  1  4  9 16 25 36 49 64 81]
[  0   1   8  27  64 125 216 343 512 729]
[ True  True  True  True  True False False False False False]

Let's read the GASISData.csv file



In [14]:

    
data = pd.read_csv(r'C:\Users\jenng\Documents\texaspse-blog\media\f16-scientific-python\week2\GASISData.csv')









    



C:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2723: DtypeWarning: Columns (23,24,27,28,29,51,52,61,126,135,136,141,142,143,145,146) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)



In [63]:

    
#Show a preview of the GASIS data
data.head()









    Out[63]:






  
    
      
      LINKA
      STATE
      FLDNAME
      RESNAME
      R_STUDY
      PLAYNAME
      PLAYCOD
      SUBPLAYN
      SUBPLAY
      DOEFLD
      ...
      HELIUM
      OTHER
      HEAT
      ATLASREG
      RTYPEGOR
      USGSPROV
      USGSPLAY
      S_USGSPL
      PLAYDEPO
      TYPELOGF
    
  
  
    
      0
      1
      ALABAMA
      ALOE BAY
      NORPHLET
      A
      UPPER JURASSIC AGGRADATIONAL SANDSTONE
      UU A.1
      NaN
      NaN
      14081.0
      ...
      0.00
      0.0
      0
      GM
      G
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1
      2
      ALABAMA
      BEAVERTON
      CARTER NE
      R
      PENNSYLVANIAN-MISSISSIPPIAN DELTAIC AND MARINE...
      GCPN-MS-1
      NaN
      NaN
      50317.0
      ...
      0.07
      0.0
      1005
      EG
      G
      65.0
      6502.0
      O
      DELTA
      AL000001
    
    
      2
      3
      ALABAMA
      BEAVERTON
      CARTER SE
      R
      PENNSYLVANIAN-MISSISSIPPIAN DELTAIC AND MARINE...
      GCPN-MS-1
      NaN
      NaN
      50317.0
      ...
      0.07
      0.0
      1005
      EG
      G
      65.0
      6502.0
      O
      DELTA
      AL000002
    
    
      3
      4
      ALABAMA
      BEAVERTON
      LEWIS
      R
      PENNSYLVANIAN-MISSISSIPPIAN DELTAIC AND MARINE...
      GCPN-MS-1
      NaN
      NaN
      50317.0
      ...
      0.00
      0.0
      0
      EG
      G
      65.0
      6502.0
      O
      DELTA
      AL000003
    
    
      4
      5
      ALABAMA
      BETHEL CHURCH
      CARTER
      R
      PENNSYLVANIAN-MISSISSIPPIAN DELTAIC AND MARINE...
      GCPN-MS-1
      NaN
      NaN
      1994.0
      ...
      0.08
      0.5
      1004
      EG
      G
      65.0
      6502.0
      O
      DELTA
      AL000004
    
  

5 rows × 186 columns



In [18]:

    
# It's not required but let's see how large this file is
row, col = data.shape 
# data.shape will result in a (19220, 186). 
# We are just recasting these values to equal row and col respectively
print("Number of Rows: {}".format(row))
print("Number of Column: {}".format(col))









    



Number of Rows: 19220
Number of Column: 186

Creating a Cumulative Distributive Function (CDF)

If you do some statistics work, you will know there's a thing called Cumulative Distribution Function. Which basically is counting what's the chance that a random number you pick within the data set is going to be less than the current number.

1st way of plotting the CDF for Permeability [k]

Use the ['AVPERM'] from the data set for this example.
Remove all zeros within the column since it's a fluff value.
Sort the column from smallest to largest
Create a another column called Frequency [%] that calculates the row number by the total size of the column. For example, the first sorted value will be 1/size, second sorted value will be 2/size, and so on.
Plot AVGPERM vs. the Frequency.
Again, Plot AVGPERM vs. the Frequency but using log scale.



In [85]:

    
# drop all 0s
#avg_permeability = avg_permeability[avg_permeability !=0] # drop all values that are 0 because it's fluff
#avg_permeability = avg_permeability.sort_values() # sort permeability from smallest to largest
#n = np.linspace(1,avg_permeability.size,avg_permeability.size) # create a new thing that goes from 1, 2, ... size of the column
#frequency = n/avg_permeability.size # calculate the



In [88]:

    
avg_k = data['AVPERM'] # assign a single dataframe to this variable avg_k. At this point, avg_k will be of type pandas.Series
avg_k = avg_k[avg_k != 0] # drop all values that are 0
cdf_df = avg_k.to_frame() # convert series to a dataframe, assign it to variable cdf_df
cdf_df = cdf_df.sort_values(by='AVPERM') # sort dataset from smallest to largest by column AVPERM
total_num_of_values = avg_k.size # find size of column, assign it to variable total_num_of_values
# create a new column called Count that goes from 1,2..total_num_of_values
cdf_df['Count'] = np.linspace(1,total_num_of_values,total_num_of_values) 
# create a new column called Frequency that divides Count by total num of val
cdf_df['Frequency'] = cdf_df.Count/total_num_of_values 
print(cdf_df)









    



        AVPERM   Count  Frequency
16837     0.01     1.0   0.000496
18901     0.01     2.0   0.000992
13577     0.01     3.0   0.001487
17110     0.01     4.0   0.001983
13925     0.01     5.0   0.002479
12922     0.01     6.0   0.002975
12894     0.01     7.0   0.003471
17513     0.01     8.0   0.003966
9998      0.02     9.0   0.004462
19151     0.02    10.0   0.004958
941       0.02    11.0   0.005454
19043     0.02    12.0   0.005949
19178     0.02    13.0   0.006445
17625     0.02    14.0   0.006941
16799     0.03    15.0   0.007437
19047     0.03    16.0   0.007933
8448      0.04    17.0   0.008428
11107     0.04    18.0   0.008924
18670     0.05    19.0   0.009420
18627     0.05    20.0   0.009916
18304     0.05    21.0   0.010412
18108     0.05    22.0   0.010907
18688     0.05    23.0   0.011403
12712     0.05    24.0   0.011899
14701     0.05    25.0   0.012395
7540      0.06    26.0   0.012890
18005     0.06    27.0   0.013386
12139     0.07    28.0   0.013882
18948     0.08    29.0   0.014378
19073     0.09    30.0   0.014874
...        ...     ...        ...
17712  2200.00  1988.0   0.985622
1380   2200.00  1989.0   0.986118
14652  2260.00  1990.0   0.986614
23     2289.00  1991.0   0.987110
17711  2290.00  1992.0   0.987605
17995  2300.00  1993.0   0.988101
9978   2300.00  1994.0   0.988597
17757  2344.00  1995.0   0.989093
17759  2344.00  1996.0   0.989588
17761  2344.00  1997.0   0.990084
17929  2350.00  1998.0   0.990580
16718  2400.00  1999.0   0.991076
1085   2400.00  2000.0   0.991572
1053   2440.00  2001.0   0.992067
9967   2500.00  2002.0   0.992563
10009  2500.00  2003.0   0.993059
12986  2500.00  2004.0   0.993555
3212   2565.00  2005.0   0.994051
12908  2575.00  2006.0   0.994546
15592  2800.00  2007.0   0.995042
3808   2810.00  2008.0   0.995538
10008  3080.00  2009.0   0.996034
15935  3100.00  2010.0   0.996529
17731  3200.00  2011.0   0.997025
15962  3333.00  2012.0   0.997521
12917  3550.00  2013.0   0.998017
17179  4500.00  2014.0   0.998513
15931  4861.00  2015.0   0.999008
15784  5085.00  2016.0   0.999504
15989  8200.00  2017.0   1.000000

[2017 rows x 3 columns]



In [89]:

    
# plot
plt.plot(cdf_df.AVPERM, cdf_df.Frequency,label="CDF")
plt.scatter(cdf_df.AVPERM, cdf_df.Frequency,label="Data")
plt.xlabel("Avg")
plt.ylabel("Frequency")
plt.title("Plot")
plt.legend()
plt.show()



In [90]:

    
# plot
plt.semilogx(cdf_df.AVPERM, cdf_df.Frequency,label="CDF",color='purple',marker=".")
plt.xlabel("Avg")
plt.ylabel("Frequency")
plt.title("Plot")
plt.legend()
plt.show()

2nd Way for plotting a Cumulative Distributive Function for Permeability

Use the ['AVPERM'] from the data set for this example.
Remove all zeros within the column since it's a fluff value.
Create a new Count column. In this column, find how many other permeability values in the column is less than the current value.
Create a new Frequency [%] column that is Count/total size of whole column.

Take Home Problem:

Try to create the CDF for Porosity ['AVPOR'] using the first method.
Try to create the CDF for Permeability and Porosity using the second method.

Creating a Histogram

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.hist.html



In [116]:

    
avg_p = data['AVPOR'] # assign a single dataframe to this variable avg_k. At this point, avg_k will be of type pandas.Series
avg_p = avg_p[avg_p != 0]
pdf_df = avg_p.to_frame()
pdf_df.head()



In [118]:

    
pdf_df.hist(column='AVPOR',bins=10)









    Out[118]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000210387CAB38>]], dtype=object)



In [119]:

    
pdf_df.hist(column='AVPOR',bins=100)









    Out[119]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000210387C7F98>]], dtype=object)



In [125]:

    
# let's try to see what would happen if we change the x-axis to log scale
# There's really no reason why we are doing this
fig, ax = plt.subplots()
pdf_df.hist(ax=ax,column='AVPOR',bins=100)
ax.set_yscale('log')



In [123]:

    
#let's try to see what would happen if we change the y-axis to log scale
fig, ax = plt.subplots()
pdf_df.hist(ax=ax,column='AVPOR',bins=100)
ax.set_xscale('log')

Creating a Probability Density Function (PDF)

A probability density function (PDF), or density of a continuous random variable, is a function that describes the relative likelihood for this random variable to take on a given value.

You will probably see this quite a bit. The plot belows show the box plot and the probability density function of a normal distribution, or a Gaussian distribution.

As you can see, the big difference between the histogram we saw earlier and this plot is that the histogram is broken up by chunks, while this plot is more continuous.

The histogram is represented by bar charts while PDF is traditionally represented with a smooth line.

Oh! And the area under a PDF curve is equal to 1.



In [127]:

    
pdf_df.plot.density()









    Out[127]:





<matplotlib.axes._subplots.AxesSubplot at 0x210383664e0>

	LINKA	STATE	FLDNAME	RESNAME	R_STUDY	PLAYNAME	PLAYCOD	SUBPLAYN	SUBPLAY	DOEFLD	...	HELIUM	OTHER	HEAT	ATLASREG	RTYPEGOR	USGSPROV	USGSPLAY	S_USGSPL	PLAYDEPO	TYPELOGF
0	1	ALABAMA	ALOE BAY	NORPHLET	A	UPPER JURASSIC AGGRADATIONAL SANDSTONE	UU A.1	NaN	NaN	14081.0	...	0.00	0.0	0	GM	G	NaN	NaN	NaN	NaN	NaN
1	2	ALABAMA	BEAVERTON	CARTER NE	R	PENNSYLVANIAN-MISSISSIPPIAN DELTAIC AND MARINE...	GCPN-MS-1	NaN	NaN	50317.0	...	0.07	0.0	1005	EG	G	65.0	6502.0	O	DELTA	AL000001
2	3	ALABAMA	BEAVERTON	CARTER SE	R	PENNSYLVANIAN-MISSISSIPPIAN DELTAIC AND MARINE...	GCPN-MS-1	NaN	NaN	50317.0	...	0.07	0.0	1005	EG	G	65.0	6502.0	O	DELTA	AL000002
3	4	ALABAMA	BEAVERTON	LEWIS	R	PENNSYLVANIAN-MISSISSIPPIAN DELTAIC AND MARINE...	GCPN-MS-1	NaN	NaN	50317.0	...	0.00	0.0	0	EG	G	65.0	6502.0	O	DELTA	AL000003
4	5	ALABAMA	BETHEL CHURCH	CARTER	R	PENNSYLVANIAN-MISSISSIPPIAN DELTAIC AND MARINE...	GCPN-MS-1	NaN	NaN	1994.0	...	0.08	0.5	1004	EG	G	65.0	6502.0	O	DELTA	AL000004