for Data Science

Twitter: @manjush3

A tutorial on python for data science.

Contents

  • Background
  • Why use Python?
  • About this tutorial
  • Aquiring Data
  • Observing aquired data
    • Some conclutions
    • Joining data
  • Data Pre-Processing
    • Extracting useful features
    • Dealing with missing values
  • Data Visualization
  • Machine Learning Algorithms
    • Unsupervised
    • Supervised
  • References

Background

Python is a high-level programming language that lets you work quickly and integrate systems more effectively.

Designed By

  • Guido van Rossum

on year 1991

Currently has 3 million+ contributors to the language

Stable release: v3.4.1 (2014-08-01),

Why Use Python?

  • Python is powerful... and fast;
  • plays well with others;
  • runs everywhere;
  • is friendly & easy to learn;
  • is Open.

About this tutorial

Data science is a very powerful subject. It is the science of pulling useful insights from data. Data science gained lot of popularity in the recent years. In this tutorial, we will try to explore some of python tools and algorithms that can help solving data problems.

Aquiring data

For this tutorial we will try to use publicly available data sets. For initial illustrations like observing and joining data sets, we will use San Francisco restaurant data. San Francisco department of public health maintains data sets about restaurants safety scores. Since data is publicly available, aquiring them is easy. If data is available in a website which do not have any API support, we can use web scraping techniques. Since there are lot of tutorials on how to get data, I am skipping that part. For convinience, I added all the requisite data sets in to the repository . I found Jay-Oh-eN's repository quite helpful for reference.

Observing aquired data

In general, there are two kinds of data science problems. First kind could only be solved if we have domain knowlege about the data sets and the second kind are those which can be solved by all data scientists without any prior domain knowledge. Let's just look at first few rows of data sets, just to know about what kind of data we are dealing with.


In [1]:
import pandas as pd

SFbusiness_business = pd.read_csv("data/SFBusinesses/businesses.csv")

SFbusiness_business.head()


Out[1]:
business_id name address city state postal_code latitude longitude phone_number
0 10 TIRAMISU KITCHEN 033 BELDEN PL San Francisco CA 94104 37.791116 -122.403816 NaN
1 12 KIKKA 250 EMBARCADERO 7/F San Francisco CA 94105 37.788613 -122.393894 NaN
2 17 GEORGE'S COFFEE SHOP 2200 OAKDALE AVE San Francisco CA 94124 37.741086 -122.401737 14155531470
3 19 NRGIZE LIFESTYLE CAFE 1200 VAN NESS AVE, 3RD FLOOR San Francisco CA 94109 37.786848 -122.421547 NaN
4 24 OMNI S.F. HOTEL - 2ND FLOOR PANTRY 500 CALIFORNIA ST, 2ND FLOOR San Francisco CA 94104 37.792888 -122.403135 NaN

In [2]:
SFbusiness_inspections = pd.read_csv("data/SFBusinesses/inspections.csv")

SFbusiness_inspections.head()


Out[2]:
business_id Score date type
0 10 98 20121114 routine
1 10 98 20120403 routine
2 10 100 20110928 routine
3 10 96 20110428 routine
4 10 100 20101210 routine

In [3]:
SFbusiness_ScoreLegend = pd.read_csv("data/SFBusinesses/ScoreLegend.csv")

SFbusiness_ScoreLegend.head()


Out[3]:
Minimum_Score Maximum_Score Description
0 0 70 Poor
1 71 85 Needs Improvement
2 86 90 Adequate
3 91 100 Good

In [4]:
SFbusiness_violations = pd.read_csv("data/SFBusinesses/violations.csv")

SFbusiness_violations.head()


Out[4]:
business_id date description
0 10 20121114 Unclean or degraded floors walls or ceilings ...
1 10 20120403 Unclean or degraded floors walls or ceilings ...
2 10 20110428 Inadequate and inaccessible handwashing facili...
3 12 20120420 Food safety certificate or food handler card n...
4 17 20120823 Inadequately cleaned or sanitized food contact...

In [5]:
SFfood_businesses_plus = pd.read_csv("data/SFFoodProgram_Complete_Data/businesses_plus.csv")

SFfood_businesses_plus.head()


Out[5]:
business_id name address city state postal_code latitude longitude phone_no TaxCode business_certificate application_date owner_name owner_address owner_city owner_state owner_zip
0 10 TIRAMISU KITCHEN 033 BELDEN PL San Francisco CA 94104 37.791116 -122.403816 NaN H24 NaN NaN Tiramisu LLC 33 Belden St San Francisco CA 94104
1 12 KIKKA 250 EMBARCADERO 7/F San Francisco CA 94105 37.788613 -122.393894 NaN H24 NaN 7/12/2002 0:00:00 KIKKA ITO, INC. 431 South Isis Ave. Inglewood CA 90301
2 17 GEORGE'S COFFEE SHOP 2200 OAKDALE AVE San Francisco CA 94124 37.741086 -122.401737 (141) 555-5314 H24 NaN 4/5/1975 0:00:00 LIEUW, VICTOR & CHRISTINA C 648 MACARTHUR DRIVE DALY CITY CA 94015
3 19 NRGIZE LIFESTYLE CAFE 1200 VAN NESS AVE, 3RD FLOOR San Francisco CA 94109 37.786848 -122.421547 NaN H24 NaN NaN 24 Hour Fitness Inc 1200 Van Ness Ave, 3rd Floor San Francisco CA 94109
4 24 OMNI S.F. HOTEL - 2ND FLOOR PANTRY 500 CALIFORNIA ST, 2ND FLOOR San Francisco CA 94104 37.792888 -122.403135 NaN H24 NaN NaN OMNI San Francisco Hotel Corp 500 California St, 2nd Floor San Francisco CA 94104

In [6]:
SFfood_inspections_plus = pd.read_csv("data/SFFoodProgram_Complete_Data/inspections_plus.csv")

SFfood_businesses_plus.head()


Out[6]:
business_id name address city state postal_code latitude longitude phone_no TaxCode business_certificate application_date owner_name owner_address owner_city owner_state owner_zip
0 10 TIRAMISU KITCHEN 033 BELDEN PL San Francisco CA 94104 37.791116 -122.403816 NaN H24 NaN NaN Tiramisu LLC 33 Belden St San Francisco CA 94104
1 12 KIKKA 250 EMBARCADERO 7/F San Francisco CA 94105 37.788613 -122.393894 NaN H24 NaN 7/12/2002 0:00:00 KIKKA ITO, INC. 431 South Isis Ave. Inglewood CA 90301
2 17 GEORGE'S COFFEE SHOP 2200 OAKDALE AVE San Francisco CA 94124 37.741086 -122.401737 (141) 555-5314 H24 NaN 4/5/1975 0:00:00 LIEUW, VICTOR & CHRISTINA C 648 MACARTHUR DRIVE DALY CITY CA 94015
3 19 NRGIZE LIFESTYLE CAFE 1200 VAN NESS AVE, 3RD FLOOR San Francisco CA 94109 37.786848 -122.421547 NaN H24 NaN NaN 24 Hour Fitness Inc 1200 Van Ness Ave, 3rd Floor San Francisco CA 94109
4 24 OMNI S.F. HOTEL - 2ND FLOOR PANTRY 500 CALIFORNIA ST, 2ND FLOOR San Francisco CA 94104 37.792888 -122.403135 NaN H24 NaN NaN OMNI San Francisco Hotel Corp 500 California St, 2nd Floor San Francisco CA 94104

In [7]:
SFfood_violations_plus = pd.read_csv("data/SFFoodProgram_Complete_Data/violations_plus.csv")

SFfood_businesses_plus.head()


Out[7]:
business_id name address city state postal_code latitude longitude phone_no TaxCode business_certificate application_date owner_name owner_address owner_city owner_state owner_zip
0 10 TIRAMISU KITCHEN 033 BELDEN PL San Francisco CA 94104 37.791116 -122.403816 NaN H24 NaN NaN Tiramisu LLC 33 Belden St San Francisco CA 94104
1 12 KIKKA 250 EMBARCADERO 7/F San Francisco CA 94105 37.788613 -122.393894 NaN H24 NaN 7/12/2002 0:00:00 KIKKA ITO, INC. 431 South Isis Ave. Inglewood CA 90301
2 17 GEORGE'S COFFEE SHOP 2200 OAKDALE AVE San Francisco CA 94124 37.741086 -122.401737 (141) 555-5314 H24 NaN 4/5/1975 0:00:00 LIEUW, VICTOR & CHRISTINA C 648 MACARTHUR DRIVE DALY CITY CA 94015
3 19 NRGIZE LIFESTYLE CAFE 1200 VAN NESS AVE, 3RD FLOOR San Francisco CA 94109 37.786848 -122.421547 NaN H24 NaN NaN 24 Hour Fitness Inc 1200 Van Ness Ave, 3rd Floor San Francisco CA 94109
4 24 OMNI S.F. HOTEL - 2ND FLOOR PANTRY 500 CALIFORNIA ST, 2ND FLOOR San Francisco CA 94104 37.792888 -122.403135 NaN H24 NaN NaN OMNI San Francisco Hotel Corp 500 California St, 2nd Floor San Francisco CA 94104

In [8]:
# A simple way to find out how many rows are present and what columbs consist of numerical data , we can use describe()

SFfood_businesses_plus.describe()


Out[8]:
business_id latitude longitude business_certificate
count 6352.000000 5495.000000 5495.000000 1131.000000
mean 32944.535894 37.525775 -121.622553 449157.537577
std 28884.685537 3.047733 9.877572 159777.164993
min 10.000000 0.000000 -122.510896 4965.000000
25% 4138.500000 37.760272 -122.435457 446211.000000
50% 28534.500000 37.780568 -122.418129 465714.000000
75% 65468.500000 37.789875 -122.405568 471461.000000
max 74591.000000 37.875937 0.000000 4222215.000000

In [9]:
SFfood_businesses_plus.count() #NaN values are ignored


Out[9]:
business_id             6352
name                    6352
address                 6350
city                    6352
state                   6352
postal_code             6121
latitude                5495
longitude               5495
phone_no                1461
TaxCode                 6352
business_certificate    1131
application_date        4481
owner_name              6342
owner_address           6331
owner_city              6263
owner_state             6262
owner_zip               6244
dtype: int64

Some conclutions

  • Some of data sheets are quite similar to other data sheets.
  • NaN signifies null values.
  • When we look in to these data sets, we find that only some features of data are useful while others are supposed to be filtered.
  • We need more analysis about how many columns a particular data sheet consist.Then we will try to join the data sheets.
  • Data fields that matters are business_id,name,address,latitude,longitude,scores,date which are present in businesses and inspection data sheets. Remaining data fields are either repeated or not required for data problems.
  • Almost every data set consist of business_id as a primary key, we could utilize it for joining data sheets.

Joining data


In [10]:
'''pandas data frames uses left outer join, therefore all records of SFbusiness_business will be preset
   even though corresponding rows are not present on SFbusiness_inspection '''

print SFbusiness_business.columns

print SFbusiness_inspections.columns

main_table = SFbusiness_business.merge( SFbusiness_inspections, on='business_id' )

print main_table.columns


Index([u'business_id', u'name', u'address', u'city', u'state', u'postal_code', u'latitude', u'longitude', u'phone_number'], dtype='object')
Index([u'business_id', u'Score', u'date', u'type'], dtype='object')
Index([u'business_id', u'name', u'address', u'city', u'state', u'postal_code', u'latitude', u'longitude', u'phone_number', u'Score', u'date', u'type'], dtype='object')

In [11]:
# let's just look at few rows of our main_table

main_table.head(10)


Out[11]:
business_id name address city state postal_code latitude longitude phone_number Score date type
0 10 TIRAMISU KITCHEN 033 BELDEN PL San Francisco CA 94104 37.791116 -122.403816 NaN 98 20121114 routine
1 10 TIRAMISU KITCHEN 033 BELDEN PL San Francisco CA 94104 37.791116 -122.403816 NaN 98 20120403 routine
2 10 TIRAMISU KITCHEN 033 BELDEN PL San Francisco CA 94104 37.791116 -122.403816 NaN 100 20110928 routine
3 10 TIRAMISU KITCHEN 033 BELDEN PL San Francisco CA 94104 37.791116 -122.403816 NaN 96 20110428 routine
4 10 TIRAMISU KITCHEN 033 BELDEN PL San Francisco CA 94104 37.791116 -122.403816 NaN 100 20101210 routine
5 12 KIKKA 250 EMBARCADERO 7/F San Francisco CA 94105 37.788613 -122.393894 NaN 100 20121120 routine
6 12 KIKKA 250 EMBARCADERO 7/F San Francisco CA 94105 37.788613 -122.393894 NaN 98 20120420 routine
7 12 KIKKA 250 EMBARCADERO 7/F San Francisco CA 94105 37.788613 -122.393894 NaN 100 20111018 routine
8 12 KIKKA 250 EMBARCADERO 7/F San Francisco CA 94105 37.788613 -122.393894 NaN 100 20110401 routine
9 17 GEORGE'S COFFEE SHOP 2200 OAKDALE AVE San Francisco CA 94124 37.741086 -122.401737 14155531470 100 20120823 routine

Data Pre-Processing

Data is often found in a difficult to use manner. To imporve the accuracy, pre-processing is essential. We are using Biostatistics data from VANDERBILT UNIVERSITY for data pre-processing. You can find the data set here. For convinience I included it in the git repository.

Extracting useful features

Let's assume that we wanted to know how death of patient is dependent on age,sex,race,income. We are not interested in remaining features of the data set. Therefore we will make a pandas frame which serves our purpose.


In [12]:
data1 = pd.read_csv("data/support2.csv")
#Creating pandas data frame which that holds only few features about data such as age,sex,race,income and death(dead=1 | alive=0)
med = pd.DataFrame( {'age':data1['age'],
                   'death':data1['death'],
                    'sex':data1['sex'],
                    'race': data1['race'],
                    'income': data1['income'],
                     })
med.head(10)


Out[12]:
age death income race sex
0 62.84998 0 $11-$25k other male
1 60.33899 1 $11-$25k white female
2 52.74698 1 under $11k white female
3 42.38498 1 under $11k white female
4 79.88495 0 NaN white female
5 93.01599 1 NaN white male
6 62.37097 1 $25-$50k white male
7 86.83899 1 NaN white male
8 85.65594 1 NaN black male
9 42.25897 1 $25-$50k hispanic female

Dealing with missing values

Most common pre-processing step is to deal with missing values. Pandas data frames automaticlly takes null values to be NaN. We can ignore those values or replace with '0'. Filling null values with appropriate central tendencies such as median, mean, mode is considered as a better practice. For this purpose, Series data structure could be useful. A Series is a one-dimensional array-like object.


In [13]:
from pandas import Series
seriesresult = Series(x for x in med['income'])
#replacing $11-$25k with 18
seriesresult=seriesresult.replace(to_replace='$11-$25k', value='18')
#replacing under $11 to 5.5
seriesresult=seriesresult.replace(to_replace='under $11k', value='5.5')
#replacing $25k-50k with 37.5
seriesresult=seriesresult.replace(to_replace='$25-$50k', value='37.5')
#replacing >$50k with 75
seriesresult=seriesresult.replace(to_replace='>$50k', value='75')
print seriesresult


0       18
1       18
2      5.5
3      5.5
4      NaN
5      NaN
6     37.5
7      NaN
8      NaN
9     37.5
10     NaN
11     NaN
12      18
13      18
14     NaN
...
9090      18
9091     5.5
9092     NaN
9093    37.5
9094     NaN
9095     NaN
9096      18
9097      18
9098     NaN
9099     5.5
9100     NaN
9101     NaN
9102     NaN
9103     NaN
9104      18
Length: 9105, dtype: object

In [14]:
# Checking for null values
print "\nCSV Value isnull: " + str(seriesresult.isnull())
# Ignoring null values
print "\nCSV Value dropna: " + str(seriesresult.dropna())
# Replacing with '0'
print "\nCSV Value fillna(0): " + str(seriesresult.fillna(0))


CSV Value isnull: 0     False
1     False
2     False
3     False
4      True
5      True
6     False
7      True
8      True
9     False
10     True
11     True
12    False
13    False
14     True
...
9090    False
9091    False
9092     True
9093    False
9094     True
9095     True
9096    False
9097    False
9098     True
9099    False
9100     True
9101     True
9102     True
9103     True
9104    False
Length: 9105, dtype: bool

CSV Value dropna: 0       18
1       18
2      5.5
3      5.5
6     37.5
9     37.5
12      18
13      18
15      75
17    37.5
18     5.5
19      18
20      75
22    37.5
23    37.5
...
9079    37.5
9080     5.5
9081    37.5
9083     5.5
9085      75
9086      18
9087      18
9088      18
9090      18
9091     5.5
9093    37.5
9096      18
9097      18
9099     5.5
9104      18
Length: 6123, dtype: object

CSV Value fillna(0): 0       18
1       18
2      5.5
3      5.5
4        0
5        0
6     37.5
7        0
8        0
9     37.5
10       0
11       0
12      18
13      18
14       0
...
9090      18
9091     5.5
9092       0
9093    37.5
9094       0
9095       0
9096      18
9097      18
9098       0
9099     5.5
9100       0
9101       0
9102       0
9103       0
9104      18
Length: 9105, dtype: object

Since, we are dealing with ordinal data, we could replace it with median.


In [15]:
l= str(seriesresult.median())
print "\nmedian: " + l
k = float(l)
print k
#replacing with median
print "\nCSV Value fillna(0): " + str(seriesresult.fillna(k))


median: 18.0
18.0

CSV Value fillna(0): 0       18
1       18
2      5.5
3      5.5
4       18
5       18
6     37.5
7       18
8       18
9     37.5
10      18
11      18
12      18
13      18
14      18
...
9090      18
9091     5.5
9092      18
9093    37.5
9094      18
9095      18
9096      18
9097      18
9098      18
9099     5.5
9100      18
9101      18
9102      18
9103      18
9104      18
Length: 9105, dtype: object

There are better ways to replace missing values. One of the ways is to use linear regression. We will try to fit the model with a linear equation. There is a column called charges mentioning mediacal bills. Let's see if charges and income have any trend togather.


In [16]:
ourfocus = pd.DataFrame({'income':data1['income'],
                         'charges':data1['charges']})
ourfocus['income']=seriesresult # putting result of seriesresult in place of ourfocus income column
ourfocus.head(10)


Out[16]:
charges income
0 9715 18
1 34496 18
2 41094 5.5
3 3075 5.5
4 50127 NaN
5 6884 NaN
6 30460 37.5
7 30460 NaN
8 NaN NaN
9 9914 37.5

We should remove all the missing values here since we are trying to see correlation between charge and income.


In [17]:
import numpy as np
ourfocus = ourfocus.dropna().reset_index()
new = pd.DataFrame({'charges':ourfocus['charges'],
                    'income':ourfocus['income']})
#converting all the values of the data frame in to floats
new=new.applymap(lambda x:float(x))
#print ourfocus['charges'].mean
#print ourfocus['income'].mean
print new.head(10)
new.corr()


   charges  income
0     9715    18.0
1    34496    18.0
2    41094     5.5
3     3075     5.5
4    30460    37.5
5     9914    37.5
6     4353    18.0
7    19783    18.0
8    10758    75.0
9   283303    37.5
Out[17]:
charges income
charges 1.0000 0.1237
income 0.1237 1.0000

0.1237 means very sligt correlation exits between income and charges. So, now we know from above that we can't use charges to fill the missing values of income.

Data Visualization

I am using bokeh charts to show visualizations. You can find more about it here


In [18]:
#Scatter Plot
from collections import OrderedDict
from bokeh.charts import Scatter

data2 = data1.head(200) #copying first 200 to different data frame

data2['d.time'] = data2['d.time'].map(lambda x:x/365.0 ) # converting days in to years by diviing all values by 365

male = data2[(data2.sex == "male")][["age", "d.time"]]  

female = data2[(data2.sex == "female")][["age", "d.time"]] 

xyvalues = OrderedDict([("male", male.values), ("female", female.values)]) # using OrderedDict 

scatter = Scatter(xyvalues, filename = "plots/scatter.html") 
#scatter.notebook().show()
#output_notebook
#plot = scatter
scatter.title("Scatter Plot").xlabel("Age in years").ylabel("Years spent on hospitals").legend("top_left").width(600).height(400).show()
from IPython.display import HTML
HTML('<iframe src=plots/scatter.html width=700 height=500></iframe>')


Wrote plots/scatter.html
Out[18]:

In [19]:
# Bar Graph
import pandas as pd

# let's constuct an anology on how many are hospital dead in dead for each race of people in bar chart
data2 = pd.DataFrame({'race': data1['race'],'normaldeath': data1['death'] ,'hospdead': data1['hospdead']})
dead = data2[data2['normaldeath']==1].groupby('race').count()
hospdead = data2[data2['hospdead']==1].groupby('race').count()
dead['normaldeath'] = dead['normaldeath'] - hospdead['hospdead']
dead['hospdead'] = hospdead['hospdead']
print dead
from bokeh.charts import Bar
bar = Bar(dead, filename="plots/bar1.html")
bar.title("Stacked Bar Graph").xlabel("Race").ylabel("Total number of people dead") .legend("top_left").width(600).height(700).stacked().show()
from IPython.display import HTML
HTML('<iframe src=plots/bar1.html width=700 height=800></iframe>')


          hospdead  normaldeath
race                           
asian           30           28
black          383          526
hispanic        68          105
other           37           44
white         1823         3125
Wrote plots/bar1.html
Out[19]:

Machine Learning Algorithms

Unsupervised Learning

Kmeans clustering is easy to appy but it is very powerful in terms of output. we start by generating some artificial data.

In [20]:
# kmeans Clustering
import matplotlib.pyplot as plt
%matplotlib inline
plt.jet() # set the color map. When your colors are lost, re-run this.
import sklearn.datasets as datasets
X, Y = datasets.make_blobs(centers=6, cluster_std=0.5, random_state=0) #random data sets with 3 centers with std deviation of 0.5


<matplotlib.figure.Figure at 0x581bb90>

In [21]:
plt.scatter(X[:,0], X[:,1]);
plt.show()



In [22]:
from sklearn.cluster import KMeans
kmeans = KMeans(3, random_state=8)
Y_hat = kmeans.fit(X).labels_

In [23]:
plt.scatter(X[:,0], X[:,1], c=Y_hat);
plt.show()



In [24]:
plt.scatter(X[:,0], X[:,1], c=Y_hat, alpha=0.4)
mu = kmeans.cluster_centers_
plt.scatter(mu[:,0], mu[:,1], s=100, c=np.unique(Y_hat))
plt.show()
print mu


[[-1.23211442  8.04092475]
 [ 7.53975776 -0.94980578]
 [ 0.47403713  2.77387221]]

In [25]:
data3 = data1.head(200)
#print data3
plt.scatter(data3['age'], data3['d.time']);
plt.show()



In [26]:
# PCA demonstation on iris data set
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
iris = load_iris()
pca = PCA(n_components=2, whiten=True).fit(iris.data)
X_pca = pca.transform(iris.data)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=iris.target)
formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])
plt.colorbar(ticks=[0, 1, 2], format=formatter)
var_explained = pca.explained_variance_ratio_ * 100
plt.xlabel('First Component: {0:.1f}%'.format(var_explained[0]))
plt.ylabel('Second Component: {0:.1f}%'.format(var_explained[1]))


Out[26]:
<matplotlib.text.Text at 0x6580310>

It is not necessary that you are doing something good by applying PCA to your data. There are more chances of losing accuracy than gaining by applying PCA to your data.

Supervised Learning - Regression


In [27]:
# Linear Regression
from sklearn import linear_model
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split
houses = datasets.load_boston()
houses_X = houses.data[:, np.newaxis]
houses_X_temp = houses_X[:, :, 2]
X_train, X_test, Y_train, Y_test = train_test_split(houses_X_temp, houses.target, test_size=0.45)
lreg = linear_model.LinearRegression()
lreg.fit(X_train, Y_train)
plt.scatter(X_test, Y_test, color='black')
plt.plot(X_test, lreg.predict(X_test), color='red', linewidth=3)
plt.show()



In [28]:
# Decision boundry regression
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets

# Let's write a estimator for convience so that we could reuse it.
def plot_estimator(estimator, X, Y):
 estimator.fit(X, Y)
 # Plot the decision boundary. For that, we will assign a color to each
 # point in the mesh [x_min, m_max]x[y_min, y_max].   
 x_min, x_max = X[:, 0].min() -0.5, X[:, 0].max()+0.5
 y_min, y_max = X[:, 1].min()-0.5 , X[:, 1].max()+0.5
 xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),np.linspace(y_min, y_max, 100))
 Z = estimator.predict(np.c_[xx.ravel(), yy.ravel()])
 # Put the result into a color map
 Z = Z.reshape(xx.shape)
 plt.figure()
 plt.xlabel('Sepal length')
 plt.ylabel('Sepal width')
 plt.xlim(xx.min(), xx.max())
 plt.ylim(yy.min(), yy.max())
 plt.xticks(())
 plt.yticks(())
 plt.pcolormesh(xx, yy, Z, alpha=0.2,cmap='rainbow')
 plt.scatter(X[:, 0], X[:, 1], c=Y, s=20 )


# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features.
Y = iris.target

logreg = linear_model.LogisticRegression(C=1e5)

# we create an instance of Neighbours Classifier and fit the data.
logreg.fit(X, Y)

plot_estimator(logreg,X,Y)


Supervised Learning - Classification


In [29]:
from sklearn.datasets.samples_generator import make_blobs
X, Y = make_blobs(n_samples=200, centers=2,
                  random_state=0, cluster_std=0.60)

plt.scatter(X[:, 0], X[:, 1], c=Y, s=20);



In [30]:
from sklearn.svm import SVC # "Support Vector"
clf = SVC(kernel='linear')
clf.fit(X, Y)


Out[30]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [31]:
plt.scatter(X[:, 0], X[:, 1], c=Y, s=20)
x = np.linspace(plt.xlim()[0], plt.xlim()[1], 30)
y = np.linspace(plt.ylim()[0], plt.ylim()[1], 30)
Y, X = np.meshgrid(y, x)
P = np.zeros_like(X)
for i, xi in enumerate(x):
 for j, yj in enumerate(y):
  P[i, j] = clf.decision_function([xi, yj])
plt.contour(X, Y, P, colors='k',levels=[-1, 0, 1],linestyles=['--', '-', '--'])


Out[31]:
<matplotlib.contour.QuadContourSet instance at 0x675c248>

In [32]:
# Decision tree classifier
from sklearn.tree import DecisionTreeClassifier
#X, Y = make_blobs(n_samples=500, centers=3,random_state=0, cluster_std=0.60)
X = iris.data[:, :2]  # we only take the first two features.
Y = iris.target
plt.scatter(X[:, 0], X[:, 1], c=Y, s=20)


Out[32]:
<matplotlib.collections.PathCollection at 0x688f390>

In [33]:
clf = DecisionTreeClassifier(max_depth=10)
plot_estimator(clf, X, Y) # function call to plot_estimator


Decision trees tend to over fitting of data. Most of the models face the same problems. Better approach is to use a different kind of decision tree called random forest.


In [34]:
# Random forests
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10, random_state=0)
plot_estimator(clf, X, Y) # function call to plot estimator


References