Harvard CS109 final project, submitted December 2015
Kamrine Poels, Eric Fredrickson, Thibaut Perol
Human-induced earthquakes are becoming an important topic in political and scientific discussions and is now vastly reported by the media. It is well known that surface and underground mining, reservoir depletion, injection and withdrawal of fluids and gas from the subsurface are capable of inducing slip on preexisting faults and potentially earthquakes of various magnitudes. Since 2009, new drilling technologies have given access to oil and gas in previously unproductive geological formations.
The recent increase of small to large earthquake events in central and eastern United States has fueled concern that hydraulic fracturing could be responsible for the increase of the rate of earthquake acitivity. Since 2009, the oil industry has expanded its use of the hydraulic fracturing technique because of the high price of oil that makes the extraction of unconventional oil and gas now economically viable. Microearthquakes (magnitude lower than 2 on the Ritcher magnitude scale, determined from the logarithm of the amplitude of waves recorded by seismographs) are routinely generated as part of the hydraulic fracturing process (fracking) to stimulate oil and gas reservoir. However the currently practiced protocol has low risk of inducing destructive earthquakes. Over the past 100,000 fracking wells drilled, the biggest earthquake recorded was of magnitude 3.6, which is too small to pose serious safety risk.
Nevertheless, the correlation between the increase of fracking and earthquake events has drawn lots of attention from both the scientific community and the media. Ellsworth [2013] claims that some of the seismicity is associated with the increase in saltwater disposal that comes from 'flow-back' water after multistage fracturing operations (see the National Research Council report, Induced Seismicity Potential in Energy Technologies [2012]). Once the unconventional oil is extracted, along with the contaminated saltwater, this latter is reinjected into deeper sedimentary geological formations with high porosity and permeability via regulated class II underground injection control (UIC) wells. Sometimes the saline water is reinjected as part of water-flooding enchanced oil recovery. However the large increase of earthquake activity is thought to be associated with the disposal wells.
Since 2009, Oklahoma has been impacted by a significant increase of earthquake events. The exponential increase of the number of earhquakes affects all range of magnitude, from the smallest to the largest earthquake sizes. Furthemore, in this state, the number of saltwater disposal wells have increased dramatically. Sometimes, the sedimentary geological formation used to store the contaminated saltwater is hydraulically connected to the crystalline bedrock. The increase of pore fluid pressure on preexisting fault surfaces in crystalline rocks weakens the fault and can potentially trigger earthquakes. For example several of the largest earthquakes in United States in 2011 and 2012 have been triggered by disposal wells. The largest was of magnitude 5.6 in Prague, central Oklahoma in 2011 and destroyed 12 homes and injured 2 people. However, only a small fraction of the 30,000 disposal wells appears to pose safety risk issues. It becomes important to understand the processes of injection-induced seismicity.
In this project, we will examine the question: how well can be predict earthquakes in Oklahoma (the number of earthquakes, time of the next earthquake, magnitude) with the features we have at our disposal?
We collect some data from the Oklahoma Geological Survey on the history of earthquake events localized in Oklahoma state during the past century. From their website, which can be accessed by clicking here, we can access an Earthquakes catalog which gives us precious information on the data, location and magnitude of earthquakes. In order to download this catalog we write the following script. Because the catalog is very sparse for the years before the onset of modern seismic recording (1974), we only load the catalog for the years after 1980. It is claimed that the catalog is 'fairly complete to a minimum magnitude of 2.9 from 1980 to present'. A recent report on seismicity in the central and eastern United States found that the probability of missing M ≥ 3 earthquakes in the region has been near zero for decades (“Technical report: Central and eastern United States seismic source characterization for nuclear facilities,” Appendix B (Electric Power Research Institute, Palo Alto, CA; U.S. Department of Energy and U.S. Nuclear Regulatory Commission, Washington, DC; 2012)).
In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import json
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
from pyquery import PyQuery as pq
import requests
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
import random
import json
import time
import csv
import datetime
import statsmodels.api as sm
import pickle
from matplotlib import rc
import itertools
# rc('font',**{'family':'sans-serif','sans-serif':['Helvetica']})
# rc('text', usetex=True)
In [2]:
# Default plotting
from matplotlib import rcParams
dark_colors = ["#99D699", "#B2B2B2",
(0.8509803921568627, 0.37254901960784315, 0.00784313725490196),
(0.4588235294117647, 0.4392156862745098, 0.7019607843137254),
(0.9058823529411765, 0.1607843137254902, 0.5411764705882353),
(0.4, 0.6509803921568628, 0.11764705882352941),
(0.9019607843137255, 0.6705882352941176, 0.00784313725490196),
(0.6509803921568628, 0.4627450980392157, 0.11372549019607843),
(0.4, 0.4, 0.4)]
rcParams['figure.figsize'] = (12, 9)
rcParams['figure.dpi'] = 150
rcParams['axes.color_cycle'] = dark_colors
rcParams['lines.linewidth'] = 2
rcParams['axes.facecolor'] = "white"
rcParams['axes.titlesize'] = 20
rcParams['axes.labelsize'] = 17.5
rcParams['xtick.labelsize'] = 15
rcParams['ytick.labelsize'] = 15
rcParams['legend.fontsize'] = 17.5
rcParams['patch.edgecolor'] = 'none'
rcParams['grid.color']="gray"
rcParams['grid.linestyle']="-"
rcParams['grid.linewidth'] = 0.3
rcParams['grid.alpha']=1
rcParams['text.color'] = "444444"
rcParams['axes.labelcolor'] = "444444"
rcParams['ytick.color'] = "444444"
rcParams['xtick.color'] = "444444"
In [3]:
# Create an initial data frame by loading the earthquake
# catalog for year 1980
eq_df = pd.DataFrame()
eq_df = pd.read_csv('http://wichita.ogs.ou.edu/eq/catalog/1980/1980.csv')
# Add the additional years until 2015 included
for year in range(1981,2016):
url = 'http://wichita.ogs.ou.edu/eq/catalog/' + str(year) + '/' + str(year) + '.csv'
temp_df = pd.read_csv(url)
eq_df = eq_df.append(temp_df,ignore_index = True)
# the dataframe is chronologically ordered
# add a column with year as a float
# e.g, 1 Jan 2015 is 2015.0 and 31 Dec 2015 is 2015.99
year_float = []
for date in eq_df.origintime.values:
new_date = time.strptime(date[0:10], "%Y-%m-%d")
to_add =int(float(new_date.tm_yday)/36.6*100)
year_float.append(float(str(new_date.tm_year) + '.' + str(to_add)))
eq_df['year_float'] = year_float
# Drop the columns we don't want to use because it takes memory
# we do this process in place to avoid copying into an other DataFrame
# drop the error in time and location for now
# drop various other estimations of magnitude
# drop information on body waves and surface waves
eq_df.drop(['err_lon','err_lat','err_depth','err_origintime','mw','mw_src','mblg_usgs','mb','mblg_ogs',
'ml_ogs', 'ms','mfa','max_mmi','reafile','reamtime','pdlid',
'mw_ogs'], axis=1, inplace=True)
# Save data frame for future use
eq_df.to_csv('./tempdata/earthquakes_catalog.csv',sep = '|')
eq_df = eq_df[eq_df.prefmag >= 3]
# Load the data frame
# eq_df = pd.DataFrame.from_csv('./tempdata/earthquakes_catalog.csv',sep = '|')
eq_df['year'] = [int(year) for year in eq_df.year_float]
# Show the first 5 earthquakes in the catalog and their features
eq_df.head()
Out[3]:
We intentionally ignore some of the data relative to body and surface waves as well as other source of estimations of the magnitude to take less space in memory. A complete description of the columns can be found here. The column that will be used in this study is the recommended magnitude estimation from column prefmag. Here is a description of the columns we kept.
Note that OGS stands for Oklahoma Geological Survey and USGS for United States Geological Survey.
The water disposal wells data was taken from the Oklahoma Corporation Commision, Oil and Gas Department website. Among the numerous files, we opted for a dataset that contained both the year that the well was reported to the Oil and gas department and the location in coordinates. The downloaded data set, which had many observations with missing features, is in wells_reportyear.csv
.
In [5]:
# Read data
#welldf = pd.read_csv('tempdata/wells_reportyear.csv')
# Clean up data
#welldf = welldf[np.isfinite(welldf['YEAR'])]
# Discard extreme coordinates
#welldf = welldf[(welldf.LONGITUDE >= -102.918) & (welldf.LONGITUDE <= -94.466)]
#welldf = welldf[(welldf.LATITUDE >= 33.811) & (welldf.LATITUDE <= 36.99869)]
# Data set contains futuristic years, we discard those observations
#welldf = welldf[welldf.YEAR <= 2015]
# Save data frame for future use
# welldf.to_csv('tempdata/wells_data.csv')
# Load data for future use
welldf = pd.read_csv('tempdata/wells_data.csv')
# Show first 5 wells
print welldf.shape
welldf.head()
Out[5]:
We created an interactive visualization tool on the website that can be accessed here. The first tab maps the distribution of earthquakes in Oklahoma at a given year in red overlayed on the previous earthquakes in green. One can use the slider on the upper right corner to get a sense of the increase of seismic activity during the past 6 years. talk about apparent clustering
The data of the disposal wells contains wells dated from 2006 to 2012 with the amount of water disposed underground in gallons.
In [6]:
fig, ax = plt.subplots()
plt.bar(np.arange(2006,2013),welldf.groupby('year').count().volume.values, color = "#99D699")
ax.set_xticks(np.arange(2006,2013)+.5)
ax.set_xticklabels(np.arange(2006,2013))
ax.set_xlabel('Year')
ax.set_ylabel('Number of Wells')
ax.grid(False)
fig, ax = plt.subplots()
plt.bar(np.arange(2006,2013),welldf.groupby('year').volume.sum().values, color = "#99D699")
ax.set_xticks(np.arange(2006,2013)+.5)
ax.set_xticklabels(np.arange(2006,2013))
ax.set_xlabel('Year')
ax.set_ylabel('Water disposed in gallons')
ax.grid(False)
In the first plot above, the number of water disposal wells in Oklahoma are plotted by year. In the second plot, the total amount of water gallons disposed is plotted by year. The largest amount of water was disposed between 2006 and 2010. However, we believe that data in the few past years have not been reported or is not currently available to public use.
In [7]:
#REQUIRES INSTALLATION conda install basemap
#received assistance from https://peak5390.wordpress.com/2012/12/08/matplotlib-basemap-tutorial-plotting-points-on-a-simple-map/
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
import numpy as np
def plot_earthquakes(df):
map = Basemap(llcrnrlon=int(min(df.longitude))-1, llcrnrlat=int(min(df.latitude))-1,
urcrnrlon=int(max(df.longitude))+1, urcrnrlat=int(max(df.latitude))+1)
map.plot(df.longitude.values, df.latitude.values, 'ro', markersize=2)
map.drawmapboundary(fill_color='white')
map.drawcoastlines()
map.drawstates()
map.drawcounties()
plt.plot()
plt.show()
Here we show the distribution of earthquakes in Oklahoma before 2010.
In [8]:
b42010 = eq_df[(eq_df.year_float<2010)]
plot_earthquakes(b42010)
We begin with some basic visual inspection of the data. We first plot a histogram of the number of earthquakes by year.
In [9]:
#plot number of earthquakes by year
mask = eq_df['prefmag'] >=3
plt.hist(eq_df[mask]['year_float'].values, bins=2016-1980+1)
plt.xlabel('Year')
plt.ylabel('Number of Earthquakes')
plt.grid(False)
This visual inspection suggests a clear difference in the number of earthquakes starting around 2010 and since water disposal through wells becomes more intense just a few years earlier makes us suspect an association between the number of earthquakes and fracking activity (specifically, based on past research, the use of disposal wells.)
We further investigate this hypothesis visually. One way of expressing our hypothesis of a difference betwen post-2010 and pre-2010 is as follows. The cumulative number of earthquakes will, of course, increase over time and it is a reasonable first guess to think that it would increase linearly. If there is a real difference between pre-2010 and post-2010, we would expect that fitting lines to the two groups of data separately would produce lines of different slopes.
We investigate this in the visualization below. We fit a line to the pre-2010 data and smooth the points for the post-2010 data. We continue the pre-2010 line in the background for comparison.
In [10]:
# Plot inspired from Ellsworth [Science, 2013]
mask = eq_df['prefmag'] >=3
eq_count, base = np.histogram(eq_df[mask]['year_float'], bins = eq_df[mask].shape[0])
# plot the cumulative number of earthquakes of magnitude > 3
plt.figure(figsize = (9,7))
plt.plot(base[:-1], np.cumsum(eq_count), lw=3,c='green', label='Cumulative number of Earthquakes of magnitude > 3' )
mask_time = (eq_df['year_float'] < 2009) & (eq_df['prefmag'] >=3)
eq_2010, base_2010 = np.histogram(eq_df[mask_time]['year_float'], bins = eq_df[mask_time].shape[0])
# fit the earthquake count before 2009 with a 1st order polynomial
# this is the background seismicity rate
fit = np.polyfit(base_2010[:-1],np.cumsum(eq_2010),1)
fit_fn = np.poly1d(fit)
plt.plot(base[:-1],fit_fn(base[:-1]),'k--',label = 'Background seismicity until 2009')
# now fill in between
plt.fill_between(base[:-1], np.cumsum(eq_count), fit_fn(base[:-1]), color='grey', alpha = 0.3);
plt.xlim([1980,2016]);
plt.ylabel('Cumulative number of Earthquakes')
plt.xlabel('year')
plt.legend(loc =2);
plt.grid(False)
The visualization indeed suggests that the line for the pre-2010 data does not fit the post-2010 data very well.
Since this is just a visual inspection, we test the hypothesis in part 3.
The relationship between earthquake magnitude and frequency can be modeled by the Gutenberg-Richter law:
$$\log_{10}N =a-bM$$where
Typically in seismically active regions, the constant $b$ is equal to 1. The constant $a$ carries little scientific information. This power law relationship between event magnitude and frequency of occurrence is remarkably common, although the values of a and b may vary from region to region or over time. We model the seismic activity in Oklahoma before and after 2010.
In [11]:
# Function is taken and modified from:
# http://eqrm.googlecode.com/svn-history/r143/trunk/preprocessing/recurrence_from_catalog.py
def calc_recurrence(df,min_plot, min_mag = None, max_mag = None, interval = 0.1):
"""
This function reads an earthquake catalogue file and calculates the
Gutenberg-Richter recurrence parameters using both least squares and
maximum likelihood (Aki 1965) approaches.
Results are plotted for both straightline fits and bounded
Gutenberg-Richter curves. Also plotted is the curve that would result
assuming a b-value of 1.
Funtion arguments:
df: data frame containing earthquake catalog with columns prefmag for
magnitude and year_float for year
min_mag: minimum magnitude for which data will be used - i.e. catalogue
completeness
max_mag: maximum magnitude used in bounded G-R curve. If not specified,
defined as the maximum magnitude in the catlogue + 0.1 magnitude
units.
interval: Width of magnitude bins for generating cumulative histogram
of earthquake recurrence. Default avlue 0.1 magnitude units.
"""
magnitudes = df.prefmag.values
years = df.year_float.values
###########################################################################
# Read data
###########################################################################
# If maximum magnitude is not specified default value to maximum in catalogue
if max_mag is not None:
pass
else:
max_mag = magnitudes.max() + 0.1
if min_mag is not None:
pass
else:
min_mag = magnitudes.min()
num_eq = len(magnitudes)
print 'Minimum magnitude:', min_mag
print 'Total number of earthquakes:', num_eq
num_years = years.max()-years.min()
annual_num_eq = num_eq/num_years
print 'Annual number of earthquakes greater than Mw', min_mag,':', \
annual_num_eq
print 'Maximum catalog magnitude:', magnitudes.max()
print 'Mmax = ', max_mag
max_mag_bin = magnitudes.max() + 0.15
# Magnitude bins
bins = np.arange(min_mag, max_mag_bin, interval)
# Magnitude bins for plotting - we will re-arrange bins later
plot_bins = np.arange(min_mag, max_mag, interval)
###########################################################################
# Generate distribution
###########################################################################
# Generate histogram
hist = np.histogram(magnitudes,bins=bins)
# Reverse array order
hist = hist[0][::-1]
bins = bins[::-1]
# Calculate cumulative sum
cum_hist = hist.cumsum()
# Ensure bins have the same length has the cumulative histogram.
# Remove the upper bound for the highest interval.
bins = bins[1:]
# Get annual rate
cum_annual_rate = cum_hist/num_years
new_cum_annual_rate = []
for i in cum_annual_rate:
new_cum_annual_rate.append(i+1e-20)
# Take logarithm
log_cum_sum = np.log10(new_cum_annual_rate)
###########################################################################
# Fit a and b parameters using a varity of methods
###########################################################################
# Fit a least squares curve
b,a = np.polyfit(bins, log_cum_sum, 1)
print 'Least Squares: b value', -1. * b, 'a value', a
alpha = np.log(10) * a
beta = -1.0 * np.log(10) * b
# Maximum Likelihood Estimator fitting
# b value
b_mle = np.log10(np.exp(1)) / (np.mean(magnitudes) - min_mag)
beta_mle = np.log(10) * b_mle
print 'Maximum Likelihood: b value', b_mle
###########################################################################
# Generate data to plot results
###########################################################################
# Generate data to plot least squares linear curve
# Calculate y-intercept for least squares solution
yintercept = log_cum_sum[-1] - b * min_mag
ls_fit = b * plot_bins + yintercept
log_ls_fit = []
for value in ls_fit:
log_ls_fit.append(np.power(10,value))
# Generate data to plot bounded Gutenberg-Richter for LS solution
numer = np.exp(-1. * beta * (plot_bins - min_mag)) - \
np.exp(-1. *beta * (max_mag - min_mag))
denom = 1. - np.exp(-1. * beta * (max_mag - min_mag))
ls_bounded = annual_num_eq * (numer / denom)
# Generate data to plot maximum likelihood linear curve
mle_fit = -1.0 * b_mle * plot_bins + 1.0 * b_mle * min_mag + np.log10(annual_num_eq)
log_mle_fit = []
for value in mle_fit:
log_mle_fit.append(np.power(10,value))
# Generate data to plot bounded Gutenberg-Richter for MLE solution
numer = np.exp(-1. * beta_mle * (plot_bins - min_mag)) - \
np.exp(-1. *beta_mle * (max_mag - min_mag))
denom = 1. - np.exp(-1. * beta_mle * (max_mag - min_mag))
mle_bounded = annual_num_eq * (numer / denom)
# Compare b-value of 1
fit_data = -1.0 * plot_bins + min_mag + np.log10(annual_num_eq)
log_fit_data = []
for value in fit_data:
log_fit_data.append(np.power(10,value))
###########################################################################
# Plot the results
###########################################################################
# Plotting
fig = plt.figure(1)
ax = fig.add_subplot(111)
plt.scatter(bins, new_cum_annual_rate, label = 'Catalogue')
ax.plot(plot_bins, log_ls_fit, c = 'r', label = 'Least Squares')
ax.plot(plot_bins, ls_bounded, c = 'r', linestyle ='--', label = 'Least Squares Bounded')
ax.plot(plot_bins, log_mle_fit, c = 'g', label = 'Maximum Likelihood')
ax.plot(plot_bins, mle_bounded, c = 'g', linestyle ='--', label = 'Maximum Likelihood Bounded')
ax.plot(plot_bins, log_fit_data, c = 'b', label = 'b = 1')
#ax.plot(bins, ls_fit2, c = 'k')
ax.set_yscale('log')
ax.legend(loc = 3)
ax.set_ylim([min_plot, max(new_cum_annual_rate) * 10.])
ax.set_xlim([min_mag - 0.5, max_mag + 0.5])
ax.set_ylabel('Annual probability')
ax.set_xlabel('Magnitude')
ax.grid(False)
In [12]:
# Plot for all dataset
calc_recurrence(eq_df,1e-7)
The constant $b$ is calculated by using least squares and maximum likelihood. The calculated value of $b$ is 1.53, which means there are more earthquakes of lower magnitude than higher magnitude.
There are reasons to believe that the number of earthquakes per year, hour, other unit of time can be approximated fairly well by a Poisson distribution, specifically that Q~Pois( $\lambda t$) where Q is the number of earthquakes, t is some unit of time, and $ \lambda $ is a rate parameter. The reasons for this as follows.
The Binomial distribution describes the total number of "successes" out of n trials with a probability of success p. The Poisson distribution arises as n goes to infinite and p goes to 0 while np remains constant. Thus, it is a good approximation when there are many, many trials, but the probability of success is quite local for any given trial. Thus, the Poisson distribution is likely a good approximation for earthquakes. For a given year, hour, etc. there is a low probability that any particular very small region in oklahoma will experience a magnitude>3 earthquake. However, there are many very small regions in Oklahoma.
It can be further shown that if Q~Pois$(\lambda* t)$, D~Expo($\lambda$) where D is the amount of time between any two successive earthquakes.
Thus, as an initial baseline, we plot the amount of time between earthquakes for the whole state of Oklahoma and see how well it can be modelled with an exponential distribution.
In [13]:
import datetime
#function to create list of interarrival times in hours from df
def get_hours_between(df):
dates=[]
origintimes = df.origintime.values
for date in origintimes:
year, month, day = date.split('-')
day, hour = day.split(' ')
hour, minute, second = hour.split(':')
if len(second.split('.'))==2:
second, microsecond = second.split('.')
elif len(second.split('.'))==1:
microsecond=0
dates.append(datetime.datetime(int(year), int(month), int(day), int(hour), int(minute),
int(second), int(microsecond)))
dates=sorted(dates)
deltas=[]
for i in range(1,len(dates)):
delta = dates[i] - dates[i-1]
delta = delta.total_seconds()/3600
deltas.append(delta)
deltas = np.array(deltas)
return deltas
In [14]:
deltas = get_hours_between(eq_df[eq_df.prefmag>=3])
In [15]:
#http://statsmodels.sourceforge.net/devel/generated/statsmodels.graphics.gofplots.ProbPlot.html
#fit and plot exponential to data
def fit_expo(deltas, plot=True, bins=100, xlab1='Hours', ylab1='Probability Density',
xlab2='Theoretical Quantiles', ylab2='Sample Quantiles',
xmax=100, ymax=0.02, k=1., force_lambd=None, summary_stats=True):
#If we are inherting a previous model for comparison, use previous lambda
if pd.notnull(force_lambd):
#Calculate Summary Statistics
lambd=force_lambd
mean = np.mean(deltas) #sample mean
#If we are fitting the model on this dataset separately, calculate lambda from data
elif pd.isnull(force_lambd):
#Use Maxmimum Likelihood Estimator to get exponential fit
expo_fit = sp.stats.expon.fit(deltas, floc=0)
#Calculate Summary Statistics
lambd = expo_fit[1] #fitted mean
#mean = lambd
#Calculate more summary statistics
k=float(k)
n = float(len(deltas))
ssr = np.sum((deltas-lambd)**2)
aic = n*np.log(ssr/n) + 2*k
mse = np.mean((deltas-lambd)**2)
L1norm = np.mean(np.absolute(deltas-lambd))
actual_95_int = np.percentile(deltas1, 2.5), np.percentile(deltas1, 97.5)
pred_95_int = sp.stats.expon.interval(0.95, scale=lambd)
if plot==True:
#Overlay fitted exponential over histogram
f, axes = plt.subplots(nrows=int(1), ncols=int(2), squeeze=False)
f.set_figheight(8)
f.set_figwidth(18)
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=0.5, hspace=0.5)
axes[0][0].hist(deltas, bins=bins, normed=True, zorder=0, color='grey',
label='Hours')
x = np.arange(0,xmax,0.01)
y = sp.stats.expon.pdf(x, scale=lambd)
axes[0][0].grid(False)
axes[0][0].plot(x,y, color='green', zorder=10, label='Exponential PDF')
axes[0][0].set_xlabel(xlab1)
axes[0][0].set_ylabel(ylab1)
axes[0][0].set_xlim(0,xmax)
axes[0][0].set_ylim(0,ymax)
axes[0][0].legend()
axes[0][0].set_title('Hours Between Quakes')
#QQ Plot for fit to exponential
ppstats = sm.ProbPlot(deltas, sp.stats.expon, loc=0, scale=lambd)
#Calculate one more summary statistic
t_quants = ppstats.theoretical_quantiles #x
s_quants = ppstats.sample_quantiles #y
y_fit = t_quants
qq_ssr = np.sum((s_quants-y_fit)**2)
qq_sst = np.sum((s_quants-np.mean(s_quants))**2)
qq_r2 = 1-(qq_ssr/qq_sst)
#Another plot
#QQ's
axes[0][1].scatter(t_quants,s_quants, marker='o', color='green')
#45 degree reference line
x = np.arange(0,20000, 1.)
y = x
axes[0][1].plot(x,y, color='gray')
axes[0][1].grid(False)
axes[0][1].set_xlabel(xlab2)
axes[0][1].set_ylabel(ylab2)
axes[0][1].set_xlim(0,20000)
axes[0][1].set_ylim(0,20000)
axes[0][1].tick_params(axis='x', which='both', bottom='off', top='off', labelbottom='off')
axes[0][1].set_title('QQ Plot')
#Print Summary Statistics
if summary_stats==True:
print "Summary Statistics for Distributional Fit"
print "lambda = " + str(lambd)
if pd.notnull(force_lambd):
print "sample mean = " + str(mean)
print "sample size = " + str(n)
print "AIC = " + str(aic)
print "QQ R^2 = " + str(qq_r2)
print "Actual 95% Confidence Interval = " + str(actual_95_int)
print "Predicted 95% Confidence Interval = " + str(pred_95_int)
print "Summary Statistics for Prediction"
print "MSE = " + str(mse)
print "L1 Norm = " + str(L1norm)
return(lambd, n, aic, k, mse, qq_r2, t_quants, s_quants, L1norm, actual_95_int, pred_95_int)
In [16]:
deltas1 = get_hours_between(eq_df[eq_df.prefmag>=3])
summary1 = fit_expo(deltas1, plot=True, bins=500,
xmax=500, ymax=0.024, k=1., force_lambd=None, summary_stats=True)
deltas1
Out[16]:
The above plot shows that the fit is not too good. The exponential curve like the histogram does decrease at an increasing rate, but overall the curve does not fit too well. Further the QQ Plot shows that the data does not appear to be well approximated by an exponential distribution.
In [17]:
#get the 5th quartile
int((float(len(deltas1))/100)*5)
#get the 9th quartile
int((float(len(deltas1))/100)*95)
Out[17]:
In [18]:
# Separate the data by time period
pre2010 = eq_df[eq_df.year_float <= 2010]
post2010 = eq_df[eq_df.year_float > 2010]
The first test that we perform is a two sample t-test that compares the number of earthquakes per year. Our null hypothesis is that the number of events per year before 2010 is equal to that after 2010. The alternative is that the median number of earthquakes per year after 2010 is greater than that before 2010.
We perform a tow sample Welch’s t-test on the log of the number of earthquakes since our data is severely skewed to the right:
In [19]:
eq_df_priorByYear = pre2010.groupby('year').count().reset_index().prefmag
eq_df_postByYear = post2010.groupby('year').count().reset_index().prefmag
# Plot samples to see distribution
fig, ax = plt.subplots()
data = [np.log(eq_df_postByYear), np.log(eq_df_priorByYear)]
plt.boxplot(data,vert=0, )
plt.grid(False)
ax.set_aspect(1.5)
ax.set_xlim([-.5, 7])
ax.set_xlabel('Log Number of earthquakes per year in tens')
ax.set_yticks([1,2])
ax.set_yticklabels(['After 2010','Before 2010'])
Out[19]:
In [20]:
# Perform t-test (Welch’s t-test)
# Null hypothesis: number of earthquakes per year are the same in both time periods
# Alternative hypothesis: number of earthquakes per year are NOT the same between time periods
t_test = sp.stats.ttest_ind(np.log(eq_df_priorByYear), np.log(eq_df_postByYear), equal_var=False)
t_test
Out[20]:
With a p-value of .0003, we reject the null hypothesis and conclude that the medians differ. In addition, we perform a permutation test with replacement in case the log-transformed data does not follow a normal distribution:
In [21]:
# Prepare data
alleq_df_ByYear = eq_df.groupby('year').count().reset_index().prefmag
n = len(eq_df_postByYear)
N = len(eq_df.groupby('year').count().reset_index().prefmag)
# Get observed difference
obsdiff = np.log(eq_df_postByYear.mean())-np.log(eq_df_priorByYear.mean())
# Start list vector for storing difference of means
mediff = []
# Run permutations
for i in range(100000):
post = np.random.choice(alleq_df_ByYear.values, size = n, replace = True)
prior = np.random.choice(alleq_df_ByYear.values, size = N-n, replace = True)
mediff.append(np.log(post.mean())-np.log(prior.mean()))
# Plot results
plt.hist(mediff, bins = 50)
plt.axvline(obsdiff, color='b', linestyle='dashed', linewidth=2, label = 'Observed difference')
plt.xlabel('Difference of earthquakes per year between prior and post 2010')
plt.ylabel('Frequency')
plt.legend()
plt.grid(False)
# Get p.value
count = 0
for diff in mediff:
if diff > obsdiff:
count += 1
print count/100000.
With a p-value of 0.00015, we reject the null hypothesis and conclude the same as before.
We apply the Gutenberg-Richter law again to our data but this time with the data separated into events prior to 2010 and after 2010. The goal of this application is to observe the difference of the relationship between frequency and magnitude from these time periods.
In [22]:
calc_recurrence(pre2010,1e-10)
In [23]:
calc_recurrence(post2010,1e-5)
The $b$ value after 2010 is smaller than that prior to 2010, suggesting a smaller ratio of eathquakes of large magnitude to lower magnitude after 2010 than usual. However, after 2010 there are 295 annual earthquakes with magnitude greater than 3 while there were only 2 prior to 2010 annually.
When we separate pre- and post-2010, we find that the pre-2010 is fairly well described by a Poisson Process. The first plot below shows the lack of fit to the pre-2010 data of the model that was created on the basis of the the whole dataset together. The second plot shows that fitting the curve to the pre-2010 data alone fits a Poisson Process model much better. We penalize the increase in parameters for the overall model by using Akaike's Information Criterion. A lower value of AIC suggests a better model. For the split, we have two parameters and one parameter in the case of not having the split.
In [24]:
deltas2 = get_hours_between(eq_df[(eq_df.prefmag>=3) & (eq_df.year_float<2010)])
summary2 = fit_expo(deltas2, plot=True, bins=100, xmax=5000, ymax=0.00085, k=1., force_lambd=summary1[0],
summary_stats=True, xlab1='Hours', ylab1='Probability Density', xlab2='Theoretical Quantiles')
In [25]:
summary3 = fit_expo(deltas2, plot=True, bins=30, xmax=10000, ymax=0.0004, k=2., force_lambd=None, summary_stats=True)
In [26]:
deltas3 = get_hours_between(eq_df[(eq_df.prefmag>=3) & (eq_df.year_float<2010)])
summary3 = fit_expo(deltas2, plot=True, bins=100, xmax=5000, ymax=0.00085, k=1., force_lambd=summary1[0],
summary_stats=True, xlab1='Hours', ylab1='Probability Density', xlab2='Theoretical Quantiles')
However, the above plot shows that while the model does better on the pre-2010 data, the post-2010 is still not well fit with a Poisson Process.
In this section we compare various regression models that try to predict the number of earthquakes based on the number of wells and the volume injected in disposal wells. First we filter the earthquakes catalog and keep only the events of magnitude higher than 3 (the ones felt by people). We also initialize a dictionary results_dic in which we will store the results of all the regressions.
In [27]:
# ------------------------------------------
# THE REGRESSION SECTION HAS SOME CODES THAT TAKE MORE THAN 40-45 MINUTES TO RUN,
# EVEN WITH THE MULTITHREADING TECHNIQUE USED
# YOU CAN RUN THE WHOLE SECTION BY RUNNING THE FOLLWING COMMAND:
# run grid_regression_sklearn.py
# however this won't plot the results from the clustering algorithm
# (see details later)
# ------------------------------------------
In [28]:
# ------------------------------------------
# LOAD DATAFRAMES
# ------------------------------------------
# Load the earthquakes dataframe
eq_df = pd.DataFrame.from_csv('./tempdata/earthquakes_catalog_treated.csv',sep = '|')
# filter to keep only magnitude >= 3
eq_df = eq_df[eq_df.prefmag >= 3.0]
# for ease add column year
eq_df['year'] = map(lambda x: int(x), eq_df['year_float'])
# Load the wells dataframe.
welldf = pd.DataFrame.from_csv('tempdata/wells_data.csv')
# define the number of threads for the computation
# please optimize according to your number of cores
num_threads = 4
# initialize the dictionary in which we store the results of all the regressions
results_dic= {}
In [29]:
import pickle
import itertools as it
# library for multithreading
import threading
# library for print messages from each thread
import logging
logging.basicConfig(level=logging.DEBUG,
format='[%(levelname)s] (%(threadName)-10s) %(message)s',
)
# for time split
import datetime
# Load package for linear model
from sklearn import linear_model
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn import preprocessing
# import library for calculating the distance
# if this line does not work then install the library
# using: pip install geopy
from geopy.distance import great_circle
Because we will also do a regression to predic the time between each earthquake we introduce here a function that evaluate the time in seconds between earthquakes from a dataframe df.
In [30]:
def get_hours_between(df):
dates=[]
origintimes = df.origintime.values
for date in origintimes:
year, month, day = date.split('-')
day, hour = day.split(' ')
hour, minute, second = hour.split(':')
if len(second.split('.'))==2:
second, microsecond = second.split('.')
elif len(second.split('.'))==1:
microsecond=0
dates.append(datetime.datetime(int(year), int(month), int(day), int(hour), int(minute),
int(second), int(microsecond)))
dates=sorted(dates)
intertimes =[]
for i in range(1,len(dates)):
delta = dates[i] - dates[i-1]
delta = delta.total_seconds()/3600
intertimes .append(delta)
return intertimes * 60
Here is a function that creates a mask for a region defined by bounds in latitude and longitude. The mask for the various regions are ouput of the partition_state function that partition Oklahoma state into squares of a given length.
In [31]:
def mask_region(df, region):
mask_region = (df['latitude'] < region[0][1]) \
& (df['latitude'] >= region[0][0]) \
& (df['longitude'] < region[1][1]) \
& (df['longitude'] >= region[1][0])
return mask_region
Here is the function that partition Oklahoma state into cells. We provide to the function an interval to split the minimum and maximum latitude and longitude. The ouput of the function is a dictionary with keys the bounds of the cells as ( (minimum latitude, maximum latitude), (minimum longitude, maximum longitude) ) and value the cell number.
In [32]:
def partition_state(interval):
# ------------------------------------------
# PARTITION THE MAP INTO CELLS = CREATE THE GRID
# ------------------------------------------
# Make ranges
# Since all earthquakes are in Oklahoma we partition roughly
# using the upper bound of the state limit
xregions1 = np.arange(33.5, 37.0, interval)
xregions2 = np.arange(33.5 + interval, 37.0 + interval, interval)
xregions = zip(xregions1, xregions2)
yregions1 = np.arange(-103.0,-94. , interval)
yregions2 = np.arange(-103.0 + interval ,-94.0 + interval, interval)
yregions = zip(yregions1, yregions2)
# Create a dictionary with keys = (slice in long, slice in latitude)
# value = number of the grid cell
regions = it.product(xregions,yregions)
locdict = dict(zip(regions, range(len(xregions)*len(yregions))))
return locdict
Once the predictor and response vectors are assembled (see details in the next subsection), we split into a training and a test set. The size of the test set is set to 1/3 of the total size of the sample. This is done using the train_test_split function from scikit-learn. We then try various standardization of the data. The first standardization tested transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation. This is done using the preprocessing.StandardScaler() method. We also tested the preprocessing.MinMaxScaler() method that simply rescale the data from 0 to 1. Finally the third method does not transform the initial data. Finally we implement a ridge regression with cross-validation of the alpha parameter (parameter used for the L2 regularization). Since we only have two features we don't bother doing a Lasso regression.
In [33]:
def do_regression(X,Y,reg,locdict,lock,cv, standardization):
# --------------------
# SPLIT INTO TRAIN AND TEST
# --------------------
# Split in train - test
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)
# --------------------
# STANDARDIZATION OF THE DATA -- SCALING
# --------------------
if standardization == 'scaler':
scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
y_train = scaler.fit_transform(y_train)
y_test = scaler.transform(y_test)
elif standardization == 'MinMaxScaler':
min_max_scaler = preprocessing.MinMaxScaler()
X_train = min_max_scaler.fit_transform(X_train)
X_test = min_max_scaler.transform(X_test)
y_train = min_max_scaler.fit_transform(y_train)
y_test = min_max_scaler.transform(y_test)
else:
pass
# --------------------
# OPTIMIZE CLASSIFIER WITH RIDGE REGRESSION
# AND ORDINARY LEAST SQUARE REGRESSION
# --------------------
# # Using Ordinary Least Square Regression
# clf = linear_model.LinearRegression()
# clf.fit(X_train, y_train)
# logging.debug('For {} cells the score is {}'.format(len(locdict.keys()),clf.score(X_test, y_test)))
# # Using Ridge Regression and cross-validation
# # doing the selection manually
# # uncomment this part to check it matches the next paragraph
# clf = linear_model.Ridge()
# parameters = {'alpha': [0.1, 0.5]}
# gs = GridSearchCV(clf, param_grid=parameters, cv=5)
# gs.fit(X_train, y_train)
# best = gs.best_estimator_
# best.fit(X_train, y_train)
# logging.debug('For {} cells the score of manual Ridge is {}'.format(len(locdict.keys()),best.score(X_test, y_test)))
# Using Ridge Regression with built-in cross validation
# of the alpha parameters
# note that alpha = 0 corresponds to the Ordinary Least Square Regression
clf = linear_model.RidgeCV(alphas=[0.0, 0.1, 1, 10.0, 100.0, 1e3,1e4 ,1e5], cv = cv)
clf.fit(X_train, y_train)
return clf, X_test, y_test
In this section we present the various type of regressions done in this study. We first write a function split_in_batch that will be used to parallelized our code. Since the loop over all the grid sizes is embarassingly parallel, we distribute various grid sizes to different workers.
In [34]:
def split_in_batch(intervals):
# split randomely the letters in batch for the various threads
all_batch = []
size_batch = len(intervals) / num_threads
for i in range(num_threads-1):
batch_per_threads = random.sample(intervals,size_batch)
all_batch.append(batch_per_threads)
# new set
intervals = list(set(intervals) - set(batch_per_threads))
# now get the rest
all_batch.append(intervals)
print('look at all_batch {}'.format(all_batch))
return all_batch
The function print_best_score_append_dictionary is used to print the best results of the trained classifier. We also append into a dictionary results_dic these results.
In [35]:
def print_best_score_append_dictionary(reg_type):
best_score_prior = [[c[1],c[2]] for c in best_grid_prior]
best_score_prior = np.array(best_score_prior)
best_index_prior = np.where(best_score_prior[:,0] == max(best_score_prior[:,0]))[0][0]
print('Best classifier <2010 is for alpha ={}'.format(best_grid_prior[best_index_prior][0].alpha_))
print('Coefs <2010 are ={}'.format(best_grid_prior[best_index_prior][0].coef_))
print('R^2 <2010 = {}'.format(best_grid_prior[best_index_prior][1]))
prior_dic = {'coefs':best_grid_prior[best_index_prior][0].coef_ , 'r2': best_grid_prior[best_index_prior][1]}
if reg_type == 'grid':
prior_dic['interval'] = best_grid_prior[best_index_prior][2]
print('Best interval <2010 is {}'.format(best_grid_prior[best_index_prior][2]))
elif reg_type == 'cluster':
prior_dic['eps'] = best_grid_prior[best_index_prior][2]
print('Best eps <2010 is {}'.format(best_grid_prior[best_index_prior][2]))
best_score_post = [[c[1],c[2]] for c in best_grid_post]
best_score_post = np.array(best_score_post)
best_index_post = np.where(best_score_post[:,0] == max(best_score_post[:,0]))[0][0]
print('Best classifier >= 2010 is for alpha ={}'.format(best_grid_post[best_index_post][0].alpha_))
print('Coefs >= 2010 are ={}'.format(best_grid_post[best_index_post][0].coef_))
print('R^2 >= 2010 = {}'.format(best_grid_post[best_index_post][1]))
post_dic = {'coefs':best_grid_post[best_index_post][0].coef_ , 'r2': best_grid_post[best_index_post][1]}
if reg_type == 'grid':
post_dic['interval'] = best_grid_post[best_index_post][2]
print('Best interval >= 2010 is {}'.format(best_grid_post[best_index_post][2]))
elif reg_type == 'cluster':
post_dic['eps'] = best_grid_post[best_index_post][2]
print('Best eps >= 2010 is {}'.format(best_grid_post[best_index_post][2]))
return prior_dic, post_dic
We use here two methods. The first one creates square cells over the OK state. On which cell we evaluate the number of earthquakes (fed into the response vector) and the number of disposal wells as well as the total volume of water injected (these two features are appended into the predictor vector). Doing this for all the cells, we do a regression: $$ number \,of\, earthquakes \propto number\, of\, wells + total \,volume\, injected $$
The function do_grid_regresssion assemble the predictor and response vectors, $X$ and $Y$ that are fed to the do_regression. We do the regression for the data prior 2010 and post 2010. The score of the classifier ($R^2$, the coefficient of determination) is appended to best_grid_prior and best_grid_prior for data prior 2010 and post 2010 respectively.
In [36]:
def do_grid_regression(eq_df, welldf, intervals, lock ,cv = 5, standardization = None):
global best_grid_prior
global best_grid_post
for interval in intervals:
# Get dictionary for the partitioned state
locdict = partition_state(interval)
# Filter by time
eq_df_prior = eq_df[eq_df.year < 2010]
welldf_prior = welldf[welldf.year < 2010]
eq_df_post = eq_df[eq_df.year >= 2010]
welldf_post = welldf[welldf.year >= 2010]
X_prior = []
X_post = []
Y_prior = []
Y_post = []
### Start grid size loop here
for region in locdict.keys():
# generate dataframe for regression with data < 2010
# add the number of quakes per region
Y_prior.append(eq_df_prior[mask_region(eq_df_prior,region)].count().values[0])
# add the number of wells per region
# add the total volume injected per region
# add them with into X_prior as [nb wells, volume]
X_prior.append([welldf_prior[mask_region(welldf_prior,region)].count().values[0]
, welldf_prior[mask_region(welldf_prior,region)].volume.sum()])
# generate dataframe for regression with data >= 2010
# add the number of quakes per region
Y_post.append(eq_df_post[mask_region(eq_df_post,region)].count().values[0])
# add the number of wells per region
# add the total volume injected per region
# add them with into X_post as [nb wells, volume]
X_post.append([welldf_post[mask_region(welldf_post,region)].count().values[0]
, welldf_post[mask_region(welldf_post,region)].volume.sum()])
X_prior = np.array(X_prior,dtype=np.float64)
X_post = np.array(X_post,dtype=np.float64)
Y_post = np.array(Y_post, dtype=np.float64).reshape(-1,1)
Y_prior = np.array(Y_prior, dtype = np.float64).reshape(-1,1)
# ------------------------------------------
# DOING THE REGRESSION
# ------------------------------------------
# logging.debug(' For {} cells, Total number of quakes: prior {}, post {}'\
# .format(len(locdict.keys()),sum(X_prior[:,0]), sum(X_post[:,0]) ))
reg_for = ['prior', 'post']
for reg in reg_for:
if reg == 'prior':
X = X_prior
Y = Y_prior
elif reg == 'post':
X = X_post
Y = Y_post
clf, X_test, y_test = do_regression(X,Y,reg,locdict,lock, cv ,standardization)
logging.debug('{}: For {} cells the score of RidgeCV is {} with alpha = {}'\
.format(reg,len(locdict.keys()),clf.score(X_test, y_test),clf.alpha_))
with lock:
if reg == 'prior':
best_grid_prior.append([clf,clf.score(X_test, y_test),interval])
elif reg == 'post':
best_grid_post.append([clf,clf.score(X_test, y_test),interval])
return
Here we run the ridge regression on a grid. We loop over a range of grid sizes. Oklahoma state is splitted into 10 to 3100 cells. In each case we do the ridge regression and find the best classifier using the coefficient of determination as a metric. Generating and training the classifier on various grids is an embarrassingly parrallel process. Therefore we use the multithreading module from python to do this. Each thread has a couple of grids to generate and do the regression for. We use a lock to append the results into the global lists best_grid_prior and best_grid_post. This ensure that each thread is not writing at the same time into these shared two lists.
In [ ]:
# ------------------------------------------
# GRID REGRESSION
# RUNNING THIS CELL TAKES A WHILE ~ 15 MIN
# ------------------------------------------
best_grid_prior = []
best_grid_post = []
# define the intervals
intervals = [0.05, 0.1,0.2, 0.3, 0.4,0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.5, 2.0]
# intervals = [0.4, 0.5, 0.8,0.9, 1.0, 1.5]
# split randomely the letters in batch for the various threads
all_batch = split_in_batch(intervals)
# Vary the standardization and find optimum
for standardization in ['None','scaler','MinMaxScaler'] :
# parallelize the loop of interval
threads = []
lock = threading.Lock()
for thread_id in range(num_threads):
interval = all_batch[thread_id]
t = threading.Thread(target = do_grid_regression, \
args = (eq_df, welldf, interval, lock ,5,standardization))
threads.append(t)
map(lambda t:t.start(), threads)
map(lambda t: t.join(), threads)
prior_dic, post_dic = print_best_score_append_dictionary('grid')
grid_regression_dic = {'prior': prior_dic, 'post': post_dic }
results_dic['grid_regression'] = grid_regression_dic
print results_dic
Here we use the same grid method but do a regression on the interarrival time between events. For each grid, we evaluate the number of events, extract the time between events and append this into the response vector. Because in some cell there is no event (interarrival is $\infty$) we choose to do a regression on 1/interarrival instead because it sets the value to 0 in these cells. We also append the number of wells and the total volume injected in this cell in the predictor vector. The regression is $$ 1/interarrival \,times \propto number\, of\, wells + total \,volume\, injected $$
The function do_cluster_regression assemble the predictor and response vectors, X and Y that are fed to the do_regression function. We do the regression for the data prior 2010 and post 2010. The score of the classifier (R2, the coefficient of determination) is appended to best_grid_prior and best_grid_prior for data prior 2010 and post 2010 respectively.
In [37]:
def do_grid_interarrival_regression(eq_df, welldf, intervals, lock ,cv = 5, standardization = None):
global best_grid_prior
global best_grid_post
for interval in intervals:
# Get dictionary for the partitioned state
locdict = partition_state(interval)
# Filter by time
eq_df_prior = eq_df[eq_df.year < 2010]
welldf_prior = welldf[welldf.year < 2010]
welldf_prior.reset_index(inplace=True)
welldf_post = welldf[welldf.year >= 2010]
welldf_post.reset_index(inplace=True)
eq_df_post = eq_df[eq_df.year >= 2010]
X_prior = []
X_post = []
Y_prior = []
Y_post = []
for region in locdict.keys():
# DO THIS FOR PRIOR
# get mask for the cluster_id
mask = mask_region(eq_df_prior,region)
Y_prior_append = get_hours_between( eq_df_prior[ mask] )
if len(Y_prior_append) != 0:
Y_prior_append = [1.0/y for y in Y_prior_append]
else:
Y_prior_append = [0.0]
for y in Y_prior_append:
Y_prior.append(y)
# add the number of wells per region
# add the total volume injected per region
# add them with into X_prior as [nb wells, volume]
X_prior_append = [welldf_prior[mask_region(welldf_prior,region)].count().values[0]
, welldf_prior[mask_region(welldf_prior,region)].volume.sum()]
for i in range(len(Y_prior_append)):
X_prior.append(X_prior_append)
# DO THIS FOR POST
# get mask for the cluster_id
mask = mask_region(eq_df_post,region)
Y_post_append = get_hours_between( eq_df_post[ mask] )
# logging.debug('Y_post {}'.format(Y_post))
if len(Y_post_append) != 0:
Y_post_append = [1.0/y for y in Y_post_append if y!=0]
else:
Y_post_append = [0.0]
for y in Y_post_append:
Y_post.append(y)
# add the number of wells per region
# add the total volume injected per region
# add them with into X_post as [nb wells, volume]
X_post_append = [welldf_post[mask_region(welldf_post,region)].count().values[0]
, welldf_post[mask_region(welldf_post,region)].volume.sum()]
for i in range(len(Y_post_append)):
X_post.append(X_post_append)
X_prior = np.array(X_prior,dtype=np.float64)
X_post = np.array(X_post,dtype=np.float64)
Y_post = np.array(Y_post, dtype=np.float64).reshape(-1,1)
Y_prior = np.array(Y_prior, dtype = np.float64).reshape(-1,1)
# ------------------------------------------
# DOING THE REGRESSION
# ------------------------------------------
# logging.debug(' For {} cells, Total number of quakes: prior {}, post {}'\
# .format(len(locdict.keys()),sum(X_prior[:,0]), sum(X_post[:,0]) ))
reg_for = ['prior', 'post']
for reg in reg_for:
if reg == 'prior':
X = X_prior
Y = Y_prior
elif reg == 'post':
X = X_post
Y = Y_post
clf, X_test, y_test = do_regression(X,Y,reg,locdict,lock, cv ,standardization)
logging.debug('{}: For {} cells the score of RidgeCV is {} with alpha = {}'\
.format(reg,len(locdict.keys()),clf.score(X_test, y_test),clf.alpha_))
with lock:
if reg == 'prior':
best_grid_prior.append([clf,clf.score(X_test, y_test), interval])
elif reg == 'post':
best_grid_post.append([clf,clf.score(X_test, y_test),interval])
return
Here we run the regression on 1/interarrival times using the various grids generated. Once more we take advantage of parallel computing method to speed up significantly the algorithm. The best classifier, according to the coefficient of determination measure, is once more added into results_dic dictionary.
In [ ]:
# ------------------------------------------
# GRID 1/INTERARRIVAL REGRESSION
# THIS CELL TAKES ABOUT 10-20 MIN TO RUN
# ------------------------------------------
best_grid_prior = []
best_grid_post = []
# define the intervals
# intervals = [0.05, 0.1,0.2, 0.3, 0.4,0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.5, 2.0]
intervals = [0.8,0.9, 1.0, 1.5]
# split randomely the letters in batch for the various threads
all_batch = split_in_batch(intervals)
# Vary the standardization and find optimum
for standardization in ['None','scaler','MinMaxScaler'] :
# parallelize the loop of interval
threads = []
lock = threading.Lock()
for thread_id in range(num_threads):
interval = all_batch[thread_id]
t = threading.Thread(target = do_grid_interarrival_regression, \
args = (eq_df, welldf, interval, lock ,5,standardization))
threads.append(t)
map(lambda t:t.start(), threads)
map(lambda t: t.join(), threads)
prior_dic, post_dic = print_best_score_append_dictionary('grid')
regression_grid_interarrival = {'prior': prior_dic, 'post': post_dic }
results_dic['regression_grid_interarrival'] = regression_grid_interarrival
print results_dic
In this section we will apply the DBSCAN algorithm to look at the clustering of earthquakes of magnitude higher than 3 (often felt by people). We write here 3 functions. The function compute_dbscan run the DBSCAN clustering algorithm that identifies clusters in the events catalog we pass. The mask allows to only choose the events of magnitude higher than 3. The period is either 'prior' for before 2010 or 'post' for after 2010. Then eps and min_samples are the two main parameters of the algorithm. eps is the maximum distance between two samples for them to be considered as in the same neighborhood. min_samples is the number of events in a neighborhood for a point to be considered as a core point. This includes the point itself. The algorithm is implement using scikit-learn.
In [38]:
def compute_dbscan(eq_df,mask,period, eps, min_samples):
# Create a subset of the dataframe to work with
subset = eq_df[mask]
# ------------------------------------------------------------------------
# COMPUTE DBSCAN
# ------------------------------------------------------------------------
X = []
for (lat,longi) in zip(subset.latitude,subset.longitude):
X.append((lat,longi))
# implement the sparse matrix of distance
X_dist = np.zeros((len(X),len(X)), dtype = np.float64)
# select on index
for i1 in range(len(X)):
# loop over all the other indices
for i2 in range(len(X)):
# now find the distance
if i1 < i2:
# if i1 = i2 , distance = 0
X_dist[i1,i2] = great_circle(X[i1],X[i2]).km
# fill the symetric part of the matrix
# since distance(x1, x2) = distance(x2, x1)
X_dist[i2,i1] = X_dist[i1,i2]
db = DBSCAN(eps= eps , min_samples= min_samples , metric = 'precomputed').fit(X_dist)
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
# note that noise is when the label is -1
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
# print('Estimated number of clusters: %d' % n_clusters_)
return db, n_clusters_
Once the algorithm is performed, we store the labels of the events in the earthquake catalog. Each label corresponds to the cluster for which the event belong. If the label is -10, the event was not considered by the algorithm because we have passed a mask that filters it. If the label is -1, the event is classified as noise, it does not belong to any of the clusters. The event is either to far from all the other event or it belongs to a cluster of events that has less earthquakes than min_samples. Finally the labels range from 0 to the maximimum number of clusters minus one. The columns added are called for example 'cluster_prior_eps_8' for the events before 2010 and with eps = 8 and 'cluster_post_eps_15' for the events after 2010 and eps = 15.
In [42]:
def add_column_dataframe(eq_df,mask,eps,period, db):
# ------------------------------------------------------------------------
# APPEND COLUMN IN DATAFRAME
# ------------------------------------------------------------------------
col_name = 'cluster_' + period + '_eps_' + str(eps)
eq_df[col_name] = np.zeros((eq_df.shape[0]))
c = 0
for i in range(eq_df.shape[0]):
if mask[i]:
# put the label in the corresponding column
eq_df.loc[i,col_name] = db.labels_[c]
c += 1
else:
eq_df.loc[i,col_name] = - 10
return eq_df
The function plot_cluster_M plots the events on a map of Oklahoma and colors the various clusters. The events in black have been considered as noise by the clustering algorithm. The argument distance_to_search_for_neighboring_points correspond to eps but have been renamed for the interactive plot that is detailed in the next cell.
In [43]:
def plot_cluster_M(Period, distance_to_search_for_neighboring_points):
eps = distance_to_search_for_neighboring_points
if Period == 'before 2010':
period = 'prior'
elif Period == 'after 2010':
period = 'post'
col_name = 'cluster_' + period + '_eps_' + str(eps)
various_labels = list(set(eq_df[col_name].values) - set([-10]))
# now plot
fig = plt.figure(figsize= (11,9))
ax = fig.add_subplot(111)
# ---
# PLOT THE CLUSTERS
# ---
n_clusters_ = len(list(various_labels)) - 1
for i in range(0,len(list(various_labels))):
if list(various_labels)[i]!=-1:
ax.scatter(eq_df.longitude[eq_df[col_name] == list(various_labels)[i]], \
eq_df.latitude[eq_df[col_name] == list(various_labels)[i]],label=list(various_labels)[i],
color = sns.color_palette('hls',n_clusters_)[i], alpha =1, lw = 4);
else:
ax.scatter(eq_df.longitude[eq_df[col_name] == list(various_labels)[i]],\
eq_df.latitude[eq_df[col_name] == list(various_labels)[i]],label='noise',
color = 'k', alpha =0.1, lw=3);
csfont = {'fontsize':25}
ax.set_ylabel('Latitude',**csfont);
ax.set_xlabel('Longitude',**csfont);
ax.axis(**csfont)
ax.set_title("Number of clusters: {}".format(n_clusters_),**csfont);
# ---
# PLOT SOME CITIES
# ---
# add a couple of cities
cities = pd.read_csv('./tempdata/OK_cities.txt', sep = '\t')
# add cities
mask_towns = (cities.Name == 'Oklahoma City') | (cities.Name == 'Prague') | (cities.Name == 'Tulsa') \
| (cities.Name == 'Ardmore') | (cities.Name == 'Stillwater') \
| (cities.Name == 'Enid') | (cities.Name == 'Jones') \
| (cities.Name == 'Cherookee') | (cities.Name == 'Perry')
cities_list = ['Oklahoma City','Prague','Tulsa', \
'Ardmore','Stillwater', 'Enid', 'Jones', 'Cherookee', 'Perry']
for city in cities_list:
ax.annotate(city,\
xy=(cities[cities.Name == city].Longitude.values[0] ,\
cities[cities.Name == city].Latitude.values[0]),\
size=20)
ax.scatter(cities[mask_towns].Longitude.values[:],
cities[mask_towns].Latitude.values[:], marker = '*', s= 100)
ax.set_ylim([34,37])
ax.set_xlim([-99.2,-96])
ax.tick_params(labelsize=20)
plt.show()
return
This cell runs the DBSCAN clustering algorithm for various eps. It takes approximately 10 mins. Instead of running it, you can simply load the dataframe in which the columns that are results of the clustering algorithm are appended.
In [ ]:
# THIS CELL TAKES ABOUT 15 MIN TO RUN
# import libraries for DBSCAN
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
# import library for distance calculation
from geopy.distance import great_circle
# ------------------------------------------------------------------------
# LOAD THE DATA, GET A SUBSET MAGNITUDE > 3 AND COUNTY = LINCOLN
# ------------------------------------------------------------------------
eq_df = pd.DataFrame.from_csv('./tempdata/earthquakes_catalog.csv',sep = '|')
mask = (eq_df.prefmag >= 3.5) & (eq_df.prefmag < 2010)
# eps = the distance to search for neighboring points
# min_samples = the minimum the point required to create an individual cluster
for eps in range(5,30):
# compute_dbscan(eq_df,mask,'prior', eps = eps, min_samples = 20)
db, n_clusters_ = compute_dbscan(eq_df,mask,'prior', eps = eps , min_samples = 20)
eq_df = add_column_dataframe(eq_df,mask,eps, 'prior', db)
db, n_clusters_ = compute_dbscan(eq_df,mask,'post', eps = eps , min_samples = 20)
eq_df = add_column_dataframe(eq_df,mask,eps, 'post', db)
This cell loads the results from the algorithm and create an interactive plot in which you can visualize the clusters for various eps and before and after 2010.
In [44]:
eq_df = pd.DataFrame.from_csv('./tempdata/earthquakes_catalog_treated.csv',sep = '|')
from IPython.html.widgets import interact
global eq_df
interact(plot_cluster_M, Period = ('before 2010', 'after 2010') ,distance_to_search_for_neighboring_points=(5,29,1));
Later we will define using the DBSCAN algorithm different clusters in the catalog. Here is a function that finds the center of mass of a catalog of events. We pass the large dataframe with all earthquakes and pass a mask that corresponds to a particular cluster of earthquakes predefined from the clustering algorithm (detailed later).
In [45]:
def cluster_centroid(eq_df, mask):
n = eq_df[ mask].shape[0]
sum_lon = eq_df[ mask].longitude.sum()
sum_lat = eq_df[mask].latitude.sum()
return (sum_lat/n, sum_lon/n )
Here we pass the dataframe with the well locations and their volume. We also provide the center of mass of a cluster and the maximum radius of the cluster within which we find the number of wells from welldf and the total volume of water injected in this region.
In [46]:
def get_furthest_distance(eq_df, mask, centroid):
furthest_point = None
furthest_dist = None
for point0, point1 in zip(eq_df[mask].latitude,eq_df[mask].longitude):
point = (point0, point1)
dist = great_circle( centroid, point ).km
if (furthest_dist is None) or (dist > furthest_dist):
furthest_dist = dist
furthest_point = point
return furthest_dist
Here we pass the dataframe with the well locations and their volume. We also provide the center of mass of a cluster and the maximum radius of the cluster within which we find the number of wells from welldf and the total volume of water injected in this region.
In [47]:
def get_cluster_nwells_volume(welldf, centroid, radius):
n_wells = 0
volume = 0
for (i, coords) in enumerate(zip(welldf.latitude,welldf.longitude)):
if great_circle((coords[0],coords[1]),centroid ) < radius:
n_wells += 1
volume += welldf.loc[i, 'volume']
return [n_wells, volume]
Here we define a function that generate a mask for the loading a specific results of the clustering algorithm written in the dataframe.
In [48]:
def mask_cluster(df, period, eps, cluster_id):
''' INPUT:
df = dataframe for which we want to create a mask
period = 'post' or 'prior' for after or before 2010
eps = distance between events for the DBSCAN algorithm
cluster_id = id of the cluster for which we want to find the mask
example: we want cluster 1 from the catalog < 2010 found by DBSCAN for
eps = 8: mask_cluster(eq_df, 'prior', 8, 1)
'''
# reconstruct the column name
col_name = 'cluster_' + period + '_eps_' + str(eps)
mask_cluster = df[ col_name ] == cluster_id
return mask_cluster
In [49]:
def do_cluster_regression(eq_df, welldf, eps_s, lock ,cv = 5, standardization = None):
global best_grid_prior
global best_grid_post
# Filter by time
welldf_prior = welldf[welldf.year < 2010]
welldf_prior.reset_index(inplace=True)
welldf_post = welldf[welldf.year >= 2010]
welldf_post.reset_index(inplace=True)
for eps in eps_s:
X_prior = []
X_post = []
Y_prior = []
Y_post = []
total_prior = []
total_post = []
logging.debug('eps {} from batch {}, standardization method: {}'\
.format(eps, eps_s,standardization))
# DO THIS FOR PRIOR
# find the list of clusters
col_name = 'cluster_' + 'prior' + '_eps_' + str(eps)
clusters = list(set(eq_df[col_name].values) - set([-10]))
# this is for the clusters that are not noise
for cluster_id in clusters:
# get mask for the cluster_id
mask = mask_cluster(eq_df, 'prior', eps, cluster_id)
Y_prior_append = get_hours_between( eq_df[ mask] )
for y in Y_prior_append:
Y_prior.append(y)
# find the centroid of the cluster
centroid = cluster_centroid(eq_df, mask)
# find the largest radius = largest distance between centroid and points
# in the cluster
radius = get_furthest_distance(eq_df, mask, centroid)
# find the numbe of wells and volume within this radius
X_prior_append=get_cluster_nwells_volume(welldf_prior, centroid, radius)
total_prior.append(X_prior_append)
for i in range(len(Y_prior_append)):
X_prior.append(X_prior_append)
# add the interarrival for the events classified as noise
cluster_id = -1
# ------
mask = mask_cluster(eq_df, 'prior', eps, cluster_id)
Y_prior_append = get_hours_between( eq_df[ mask] )
for y in Y_prior_append:
Y_prior.append(y)
# find the volume
total_prior = np.array(total_prior)
X_prior_append=[welldf_prior.count().values[0] - sum(total_prior[:,0]) , welldf_prior.volume.sum() - sum(total_prior[:,1]) ]
for i in range(len(Y_prior_append)):
X_prior.append(X_prior_append)
#------
# DO THIS FOR POST
# find the list of clusters
col_name = 'cluster_' + 'post' + '_eps_' + str(eps)
clusters = list(set(eq_df[col_name].values) - set([-10]))
for cluster_id in clusters:
# get mask for the cluster_id
mask = mask_cluster(eq_df, 'post', eps, cluster_id)
Y_post_append = get_hours_between( eq_df[ mask] )
for y in Y_post_append:
Y_post.append(y)
# find the centroid of the cluster
centroid = cluster_centroid(eq_df, mask)
# find the largest radius = largest distance between centroid and points
# in the cluster
radius = get_furthest_distance(eq_df, mask, centroid)
# find the numbe of wells and volume within this radius
X_post_append=get_cluster_nwells_volume(welldf_post, centroid, radius)
total_post.append(X_post_append)
for i in range(len(Y_post_append)):
X_post.append(X_post_append)
# add the interarrival for the events classified as noise
cluster_id = -1
# ------
mask = mask_cluster(eq_df, 'post', eps, cluster_id)
Y_post_append = get_hours_between( eq_df[ mask] )
for y in Y_post_append:
Y_post.append(y)
# find the volume
total_post = np.array(total_post)
X_post_append=[welldf_post.count().values[0] - sum(total_post[:,0]) , welldf_post.volume.sum() - sum(total_post[:,1]) ]
for i in range(len(Y_post_append)):
X_post.append(X_post_append)
#------
X_prior = np.array(X_prior,dtype=np.float64)
X_post = np.array(X_post,dtype=np.float64)
# ------------------------------------------
# DOING THE REGRESSION
# ------------------------------------------
# logging.debug(' For {} cells, Total number of quakes: prior {}, post {}'\
# .format(len(locdict.keys()),sum(X_prior[:,0]), sum(X_post[:,0]) ))
reg_for = ['prior', 'post']
for reg in reg_for:
if reg == 'prior':
X = X_prior
Y = Y_prior
elif reg == 'post':
X = X_post
Y = Y_post
# --------------------
# SPLIT INTO TRAIN AND TEST
# --------------------
# Split in train - test
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)
# --------------------
# STANDARDIZATION OF THE DATA -- SCALING
# --------------------
if standardization == 'scaler':
scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
y_train = scaler.fit_transform(y_train)
y_test = scaler.transform(y_test)
elif standardization == 'MinMaxScaler':
min_max_scaler = preprocessing.MinMaxScaler()
X_train = min_max_scaler.fit_transform(X_train)
X_test = min_max_scaler.transform(X_test)
y_train = min_max_scaler.fit_transform(y_train)
y_test = min_max_scaler.transform(y_test)
else:
pass
# --------------------
# OPTIMIZE CLASSIFIER WITH RIDGE REGRESSION
# AND ORDINARY LEAST SQUARE REGRESSION
# --------------------
# Using Ridge Regression with built-in cross validation
# of the alpha parameters
# note that alpha = 0 corresponds to the Ordinary Least Square Regression
clf = linear_model.RidgeCV(alphas=[0.0, 0.1, 1, 10.0, 100.0, 1e3,1e4 ,1e5], cv =cv)
clf.fit(X_train, y_train)
logging.debug('{}: For eps = {}, score : {}'.format(reg,eps,clf.score(X_test, y_test)))
with lock:
if reg == 'prior':
best_grid_prior.append([clf,clf.score(X_test, y_test), eps])
elif reg == 'post':
best_grid_post.append([clf,clf.score(X_test, y_test),eps])
return
In [ ]:
# ------------------------------------------
# CLUSTER INTERARRIVAL REGRESSION
# THIS CELL TAKES ABOUT 15 MIN TO RUN
# ------------------------------------------
best_grid_prior = []
best_grid_post = []
eps_batch = range(5,30)
# eps_batch = [5,7,9,11,20]
# split randomely the eps in batch for the various threads
all_batch = split_in_batch(eps_batch)
# Vary the standardization and find optimum
for standardization in ['None','scaler','MinMaxScaler'] :
# parallelize the loop of interval
threads = []
lock = threading.Lock()
for thread_id in range(num_threads):
eps_s = all_batch[thread_id]
t = threading.Thread(target = do_cluster_regression, \
args = (eq_df, welldf, eps_s, lock ,5, standardization))
threads.append(t)
map(lambda t:t.start(), threads)
map(lambda t: t.join(), threads)
prior_dic, post_dic = print_best_score_append_dictionary('cluster')
regression_cluster_interarrival = {'prior': prior_dic, 'post': post_dic }
results_dic['regression_cluster_interarrival'] = regression_cluster_interarrival
print results_dic
Run the following cell for saving the results from all the regression into a dictionary results_dic.p. Then the dictionary is converted into a dataframe for plotting. The dataframe is results_df.
In [50]:
# Save a dictionary into a pickle file.
# import pickle
# pickle.dump( results_dic, open( "./tempdata/results_dic.p", "wb" ) )
# results = {}
# results['grid_regression_prior'] = results_dic['grid_regression']['prior']
# results['grid_regression_post'] = results_dic['grid_regression']['post']
# results['regression_cluster_interarrival_prior'] = results_dic['regression_cluster_interarrival']['prior']
# results['regression_cluster_interarrival_post'] = results_dic['regression_cluster_interarrival']['post']
# results['regression_grid_interarrival_prior'] = results_dic['regression_grid_interarrival']['prior']
# results['regression_grid_interarrival_post'] = results_dic['regression_grid_interarrival']['post']
# results_df = pd.DataFrame.from_dict(results, orient = 'index')
# results_df.to_csv('./tempdata/results_df.csv', sep = ',')
Run this cell if you want to load the results from this study saved into the dataframe results_df.
In [ ]:
results = pd.read_csv('tempdata/results_df.csv')
In [51]:
# Import results
results = pd.read_csv('tempdata/results_df.csv')
ord_res = results.sort('r2', ascending = False)
print results.shape
results.head(10)
Out[51]:
In [52]:
ord_res = results['r2'][[1,0,3,2,5,4]]
post = results['r2'][[0,2,4]]
prior = results['r2'][[1,3,5]]
fig, ax = plt.subplots()
b1 = plt.barh(range(6),ord_res, color = ["#99D699", "#B2B2B2","#99D699", "#B2B2B2","#99D699", "#B2B2B2"])
ax.set_title('Pre-2010')
ax.set_xlim([0,.8])
ax.set_xlabel("R-squared")
ax.set_yticks([1-.1,3-.1,5-.1])
ax.set_yticklabels(['Grid Regression','Regression with Cluster',
'Grid Regression with Interarrival'])
ax.grid(False)
ax.legend(b1, ['Prior-2010','Post-2010'])
Out[52]:
The ridge regression algorithm has shown to be effective in modeling the relationship between earthquake events and number of wells along with the volume of injected water. Because of the poor state of our data, especially the wells data which had many wells with missing years, we are satisfied with the performance of this model. The coefficient of the number of wells is negative because when we keep the total volume of injected water constant in one area, if there are more wells then the pressure of each well is less than of another area with same volume of injected water but with less wells. We conlude that the number of earthquakes are in fact associated with the volume of injected water by these wells. Surprisingly, the grid sizes vere fairly large (120.75 squared km) to yield the optimal model. This project recommends the application of ridge regression in future studies of seismic events.
In [ ]: