For this tutorial we will try to use publicly available data sets. For initial illustrations like observing and joining data sets, we will use San Francisco restaurant data. San Francisco department of public health maintains data sets about restaurants safety scores. Since data is publicly available, aquiring them is easy. If data is available in a website which do not have any API support, we can use web scraping techniques. Since there are lot of tutorials on how to get data, I am skipping that part. For convinience, I added all the requisite data sets in to the repository . I found Jay-Oh-eN's repository quite helpful for reference.
In general, there are two kinds of data science problems. First kind could only be solved if we have domain knowlege about the data sets and the second kind are those which can be solved by all data scientists without any prior domain knowledge. Let's just look at first few rows of data sets, just to know about what kind of data we are dealing with.
In [1]:
import pandas as pd
SFbusiness_business = pd.read_csv("data/SFBusinesses/businesses.csv")
SFbusiness_business.head()
Out[1]:
In [2]:
SFbusiness_inspections = pd.read_csv("data/SFBusinesses/inspections.csv")
SFbusiness_inspections.head()
Out[2]:
In [3]:
SFbusiness_ScoreLegend = pd.read_csv("data/SFBusinesses/ScoreLegend.csv")
SFbusiness_ScoreLegend.head()
Out[3]:
In [4]:
SFbusiness_violations = pd.read_csv("data/SFBusinesses/violations.csv")
SFbusiness_violations.head()
Out[4]:
In [5]:
SFfood_businesses_plus = pd.read_csv("data/SFFoodProgram_Complete_Data/businesses_plus.csv")
SFfood_businesses_plus.head()
Out[5]:
In [6]:
SFfood_inspections_plus = pd.read_csv("data/SFFoodProgram_Complete_Data/inspections_plus.csv")
SFfood_businesses_plus.head()
Out[6]:
In [7]:
SFfood_violations_plus = pd.read_csv("data/SFFoodProgram_Complete_Data/violations_plus.csv")
SFfood_businesses_plus.head()
Out[7]:
In [8]:
# A simple way to find out how many rows are present and what columbs consist of numerical data , we can use describe()
SFfood_businesses_plus.describe()
Out[8]:
In [9]:
SFfood_businesses_plus.count() #NaN values are ignored
Out[9]:
In [10]:
'''pandas data frames uses left outer join, therefore all records of SFbusiness_business will be preset
even though corresponding rows are not present on SFbusiness_inspection '''
print SFbusiness_business.columns
print SFbusiness_inspections.columns
main_table = SFbusiness_business.merge( SFbusiness_inspections, on='business_id' )
print main_table.columns
In [11]:
# let's just look at few rows of our main_table
main_table.head(10)
Out[11]:
Data is often found in a difficult to use manner. To imporve the accuracy, pre-processing is essential. We are using Biostatistics data from VANDERBILT UNIVERSITY for data pre-processing. You can find the data set here. For convinience I included it in the git repository.
In [12]:
data1 = pd.read_csv("data/support2.csv")
#Creating pandas data frame which that holds only few features about data such as age,sex,race,income and death(dead=1 | alive=0)
med = pd.DataFrame( {'age':data1['age'],
'death':data1['death'],
'sex':data1['sex'],
'race': data1['race'],
'income': data1['income'],
})
med.head(10)
Out[12]:
Most common pre-processing step is to deal with missing values. Pandas data frames automaticlly takes null values to be NaN. We can ignore those values or replace with '0'. Filling null values with appropriate central tendencies such as median, mean, mode is considered as a better practice. For this purpose, Series data structure could be useful. A Series is a one-dimensional array-like object.
In [13]:
from pandas import Series
seriesresult = Series(x for x in med['income'])
#replacing $11-$25k with 18
seriesresult=seriesresult.replace(to_replace='$11-$25k', value='18')
#replacing under $11 to 5.5
seriesresult=seriesresult.replace(to_replace='under $11k', value='5.5')
#replacing $25k-50k with 37.5
seriesresult=seriesresult.replace(to_replace='$25-$50k', value='37.5')
#replacing >$50k with 75
seriesresult=seriesresult.replace(to_replace='>$50k', value='75')
print seriesresult
In [14]:
# Checking for null values
print "\nCSV Value isnull: " + str(seriesresult.isnull())
# Ignoring null values
print "\nCSV Value dropna: " + str(seriesresult.dropna())
# Replacing with '0'
print "\nCSV Value fillna(0): " + str(seriesresult.fillna(0))
Since, we are dealing with ordinal data, we could replace it with median.
In [15]:
l= str(seriesresult.median())
print "\nmedian: " + l
k = float(l)
print k
#replacing with median
print "\nCSV Value fillna(0): " + str(seriesresult.fillna(k))
There are better ways to replace missing values. One of the ways is to use linear regression. We will try to fit the model with a linear equation. There is a column called charges mentioning mediacal bills. Let's see if charges and income have any trend togather.
In [16]:
ourfocus = pd.DataFrame({'income':data1['income'],
'charges':data1['charges']})
ourfocus['income']=seriesresult # putting result of seriesresult in place of ourfocus income column
ourfocus.head(10)
Out[16]:
We should remove all the missing values here since we are trying to see correlation between charge and income.
In [17]:
import numpy as np
ourfocus = ourfocus.dropna().reset_index()
new = pd.DataFrame({'charges':ourfocus['charges'],
'income':ourfocus['income']})
#converting all the values of the data frame in to floats
new=new.applymap(lambda x:float(x))
#print ourfocus['charges'].mean
#print ourfocus['income'].mean
print new.head(10)
new.corr()
Out[17]:
0.1237 means very sligt correlation exits between income and charges. So, now we know from above that we can't use charges to fill the missing values of income.
I am using bokeh charts to show visualizations. You can find more about it here
In [18]:
#Scatter Plot
from collections import OrderedDict
from bokeh.charts import Scatter
data2 = data1.head(200) #copying first 200 to different data frame
data2['d.time'] = data2['d.time'].map(lambda x:x/365.0 ) # converting days in to years by diviing all values by 365
male = data2[(data2.sex == "male")][["age", "d.time"]]
female = data2[(data2.sex == "female")][["age", "d.time"]]
xyvalues = OrderedDict([("male", male.values), ("female", female.values)]) # using OrderedDict
scatter = Scatter(xyvalues, filename = "plots/scatter.html")
#scatter.notebook().show()
#output_notebook
#plot = scatter
scatter.title("Scatter Plot").xlabel("Age in years").ylabel("Years spent on hospitals").legend("top_left").width(600).height(400).show()
from IPython.display import HTML
HTML('<iframe src=plots/scatter.html width=700 height=500></iframe>')
Out[18]:
In [19]:
# Bar Graph
import pandas as pd
# let's constuct an anology on how many are hospital dead in dead for each race of people in bar chart
data2 = pd.DataFrame({'race': data1['race'],'normaldeath': data1['death'] ,'hospdead': data1['hospdead']})
dead = data2[data2['normaldeath']==1].groupby('race').count()
hospdead = data2[data2['hospdead']==1].groupby('race').count()
dead['normaldeath'] = dead['normaldeath'] - hospdead['hospdead']
dead['hospdead'] = hospdead['hospdead']
print dead
from bokeh.charts import Bar
bar = Bar(dead, filename="plots/bar1.html")
bar.title("Stacked Bar Graph").xlabel("Race").ylabel("Total number of people dead") .legend("top_left").width(600).height(700).stacked().show()
from IPython.display import HTML
HTML('<iframe src=plots/bar1.html width=700 height=800></iframe>')
Out[19]:
In [20]:
# kmeans Clustering
import matplotlib.pyplot as plt
%matplotlib inline
plt.jet() # set the color map. When your colors are lost, re-run this.
import sklearn.datasets as datasets
X, Y = datasets.make_blobs(centers=6, cluster_std=0.5, random_state=0) #random data sets with 3 centers with std deviation of 0.5
In [21]:
plt.scatter(X[:,0], X[:,1]);
plt.show()
In [22]:
from sklearn.cluster import KMeans
kmeans = KMeans(3, random_state=8)
Y_hat = kmeans.fit(X).labels_
In [23]:
plt.scatter(X[:,0], X[:,1], c=Y_hat);
plt.show()
In [24]:
plt.scatter(X[:,0], X[:,1], c=Y_hat, alpha=0.4)
mu = kmeans.cluster_centers_
plt.scatter(mu[:,0], mu[:,1], s=100, c=np.unique(Y_hat))
plt.show()
print mu
In [25]:
data3 = data1.head(200)
#print data3
plt.scatter(data3['age'], data3['d.time']);
plt.show()
In [26]:
# PCA demonstation on iris data set
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
iris = load_iris()
pca = PCA(n_components=2, whiten=True).fit(iris.data)
X_pca = pca.transform(iris.data)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=iris.target)
formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])
plt.colorbar(ticks=[0, 1, 2], format=formatter)
var_explained = pca.explained_variance_ratio_ * 100
plt.xlabel('First Component: {0:.1f}%'.format(var_explained[0]))
plt.ylabel('Second Component: {0:.1f}%'.format(var_explained[1]))
Out[26]:
It is not necessary that you are doing something good by applying PCA to your data. There are more chances of losing accuracy than gaining by applying PCA to your data.
In [27]:
# Linear Regression
from sklearn import linear_model
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split
houses = datasets.load_boston()
houses_X = houses.data[:, np.newaxis]
houses_X_temp = houses_X[:, :, 2]
X_train, X_test, Y_train, Y_test = train_test_split(houses_X_temp, houses.target, test_size=0.45)
lreg = linear_model.LinearRegression()
lreg.fit(X_train, Y_train)
plt.scatter(X_test, Y_test, color='black')
plt.plot(X_test, lreg.predict(X_test), color='red', linewidth=3)
plt.show()
In [28]:
# Decision boundry regression
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets
# Let's write a estimator for convience so that we could reuse it.
def plot_estimator(estimator, X, Y):
estimator.fit(X, Y)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() -0.5, X[:, 0].max()+0.5
y_min, y_max = X[:, 1].min()-0.5 , X[:, 1].max()+0.5
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),np.linspace(y_min, y_max, 100))
Z = estimator.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color map
Z = Z.reshape(xx.shape)
plt.figure()
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.pcolormesh(xx, yy, Z, alpha=0.2,cmap='rainbow')
plt.scatter(X[:, 0], X[:, 1], c=Y, s=20 )
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
Y = iris.target
logreg = linear_model.LogisticRegression(C=1e5)
# we create an instance of Neighbours Classifier and fit the data.
logreg.fit(X, Y)
plot_estimator(logreg,X,Y)
In [29]:
from sklearn.datasets.samples_generator import make_blobs
X, Y = make_blobs(n_samples=200, centers=2,
random_state=0, cluster_std=0.60)
plt.scatter(X[:, 0], X[:, 1], c=Y, s=20);
In [30]:
from sklearn.svm import SVC # "Support Vector"
clf = SVC(kernel='linear')
clf.fit(X, Y)
Out[30]:
In [31]:
plt.scatter(X[:, 0], X[:, 1], c=Y, s=20)
x = np.linspace(plt.xlim()[0], plt.xlim()[1], 30)
y = np.linspace(plt.ylim()[0], plt.ylim()[1], 30)
Y, X = np.meshgrid(y, x)
P = np.zeros_like(X)
for i, xi in enumerate(x):
for j, yj in enumerate(y):
P[i, j] = clf.decision_function([xi, yj])
plt.contour(X, Y, P, colors='k',levels=[-1, 0, 1],linestyles=['--', '-', '--'])
Out[31]:
In [32]:
# Decision tree classifier
from sklearn.tree import DecisionTreeClassifier
#X, Y = make_blobs(n_samples=500, centers=3,random_state=0, cluster_std=0.60)
X = iris.data[:, :2] # we only take the first two features.
Y = iris.target
plt.scatter(X[:, 0], X[:, 1], c=Y, s=20)
Out[32]:
In [33]:
clf = DecisionTreeClassifier(max_depth=10)
plot_estimator(clf, X, Y) # function call to plot_estimator
Decision trees tend to over fitting of data. Most of the models face the same problems. Better approach is to use a different kind of decision tree called random forest.
In [34]:
# Random forests
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10, random_state=0)
plot_estimator(clf, X, Y) # function call to plot estimator