for Data Science

Twitter: @manjush3

A tutorial on python for data science.

Background
Why use Python?
About this tutorial
Aquiring Data
Observing aquired data
- Some conclutions
- Joining data
Data Pre-Processing
- Extracting useful features
- Dealing with missing values
Data Visualization
Machine Learning Algorithms
- Unsupervised
- Supervised
References

Background

Python is a high-level programming language that lets you work quickly and integrate systems more effectively.

Designed By

Guido van Rossum

on year 1991

Currently has 3 million+ contributors to the language

Stable release: v3.4.1 (2014-08-01),

Why Use Python?

Python is powerful... and fast;
plays well with others;
runs everywhere;
is friendly & easy to learn;
is Open.

About this tutorial

Data science is a very powerful subject. It is the science of pulling useful insights from data. Data science gained lot of popularity in the recent years. In this tutorial, we will try to explore some of python tools and algorithms that can help solving data problems.

Aquiring data

For this tutorial we will try to use publicly available data sets. For initial illustrations like observing and joining data sets, we will use San Francisco restaurant data. San Francisco department of public health maintains data sets about restaurants safety scores. Since data is publicly available, aquiring them is easy. If data is available in a website which do not have any API support, we can use web scraping techniques. Since there are lot of tutorials on how to get data, I am skipping that part. For convinience, I added all the requisite data sets in to the repository . I found Jay-Oh-eN's repository quite helpful for reference.

Observing aquired data

In general, there are two kinds of data science problems. First kind could only be solved if we have domain knowlege about the data sets and the second kind are those which can be solved by all data scientists without any prior domain knowledge. Let's just look at first few rows of data sets, just to know about what kind of data we are dealing with.



In [1]:

    
import pandas as pd

SFbusiness_business = pd.read_csv("data/SFBusinesses/businesses.csv")

SFbusiness_business.head()









    Out[1]:






  
    
      
      business_id
      name
      address
      city
      state
      postal_code
      latitude
      longitude
      phone_number
    
  
  
    
      0
       10
                         TIRAMISU KITCHEN
                       033 BELDEN PL
       San Francisco
       CA
       94104
       37.791116
      -122.403816
               NaN
    
    
      1
       12
                                    KIKKA
                250 EMBARCADERO  7/F
       San Francisco
       CA
       94105
       37.788613
      -122.393894
               NaN
    
    
      2
       17
                     GEORGE'S COFFEE SHOP
                   2200 OAKDALE AVE 
       San Francisco
       CA
       94124
       37.741086
      -122.401737
       14155531470
    
    
      3
       19
                    NRGIZE LIFESTYLE CAFE
        1200 VAN NESS AVE, 3RD FLOOR
       San Francisco
       CA
       94109
       37.786848
      -122.421547
               NaN
    
    
      4
       24
       OMNI S.F. HOTEL - 2ND FLOOR PANTRY
       500 CALIFORNIA ST, 2ND  FLOOR
       San Francisco
       CA
       94104
       37.792888
      -122.403135
               NaN



In [2]:

    
SFbusiness_inspections = pd.read_csv("data/SFBusinesses/inspections.csv")

SFbusiness_inspections.head()









    Out[2]:






  
    
      
      business_id
      Score
      date
      type
    
  
  
    
      0
       10
        98
       20121114
       routine
    
    
      1
       10
        98
       20120403
       routine
    
    
      2
       10
       100
       20110928
       routine
    
    
      3
       10
        96
       20110428
       routine
    
    
      4
       10
       100
       20101210
       routine



In [3]:

    
SFbusiness_ScoreLegend = pd.read_csv("data/SFBusinesses/ScoreLegend.csv")

SFbusiness_ScoreLegend.head()









    Out[3]:






  
    
      
      Minimum_Score
      Maximum_Score
      Description
    
  
  
    
      0
        0
        70
                    Poor
    
    
      1
       71
        85
       Needs Improvement
    
    
      2
       86
        90
                Adequate
    
    
      3
       91
       100
                    Good



In [4]:

    
SFbusiness_violations = pd.read_csv("data/SFBusinesses/violations.csv")

SFbusiness_violations.head()









    Out[4]:






  
    
      
      business_id
      date
      description
    
  
  
    
      0
       10
       20121114
       Unclean or degraded floors walls or ceilings  ...
    
    
      1
       10
       20120403
       Unclean or degraded floors walls or ceilings  ...
    
    
      2
       10
       20110428
       Inadequate and inaccessible handwashing facili...
    
    
      3
       12
       20120420
       Food safety certificate or food handler card n...
    
    
      4
       17
       20120823
       Inadequately cleaned or sanitized food contact...



In [5]:

    
SFfood_businesses_plus = pd.read_csv("data/SFFoodProgram_Complete_Data/businesses_plus.csv")

SFfood_businesses_plus.head()









    Out[5]:






  
    
      
      business_id
      name
      address
      city
      state
      postal_code
      latitude
      longitude
      phone_no
      TaxCode
      business_certificate
      application_date
      owner_name
      owner_address
      owner_city
      owner_state
      owner_zip
    
  
  
    
      0
       10
                         TIRAMISU KITCHEN
                       033 BELDEN PL
       San Francisco
       CA
       94104
       37.791116
      -122.403816
                  NaN
       H24
      NaN
                     NaN
                        Tiramisu LLC
                       33 Belden St
       San Francisco
       CA
       94104
    
    
      1
       12
                                    KIKKA
                250 EMBARCADERO  7/F
       San Francisco
       CA
       94105
       37.788613
      -122.393894
                  NaN
       H24
      NaN
       7/12/2002 0:00:00
                     KIKKA ITO, INC.
                431 South Isis Ave.
           Inglewood
       CA
       90301
    
    
      2
       17
                     GEORGE'S COFFEE SHOP
                   2200 OAKDALE AVE 
       San Francisco
       CA
       94124
       37.741086
      -122.401737
       (141) 555-5314
       H24
      NaN
        4/5/1975 0:00:00
         LIEUW, VICTOR & CHRISTINA C
                648 MACARTHUR DRIVE
           DALY CITY
       CA
       94015
    
    
      3
       19
                    NRGIZE LIFESTYLE CAFE
        1200 VAN NESS AVE, 3RD FLOOR
       San Francisco
       CA
       94109
       37.786848
      -122.421547
                  NaN
       H24
      NaN
                     NaN
                 24 Hour Fitness Inc
       1200 Van Ness Ave, 3rd Floor
       San Francisco
       CA
       94109
    
    
      4
       24
       OMNI S.F. HOTEL - 2ND FLOOR PANTRY
       500 CALIFORNIA ST, 2ND  FLOOR
       San Francisco
       CA
       94104
       37.792888
      -122.403135
                  NaN
       H24
      NaN
                     NaN
       OMNI San Francisco Hotel Corp
       500 California St, 2nd Floor
       San Francisco
       CA
       94104



In [6]:

    
SFfood_inspections_plus = pd.read_csv("data/SFFoodProgram_Complete_Data/inspections_plus.csv")

SFfood_businesses_plus.head()









    Out[6]:






  
    
      
      business_id
      name
      address
      city
      state
      postal_code
      latitude
      longitude
      phone_no
      TaxCode
      business_certificate
      application_date
      owner_name
      owner_address
      owner_city
      owner_state
      owner_zip
    
  
  
    
      0
       10
                         TIRAMISU KITCHEN
                       033 BELDEN PL
       San Francisco
       CA
       94104
       37.791116
      -122.403816
                  NaN
       H24
      NaN
                     NaN
                        Tiramisu LLC
                       33 Belden St
       San Francisco
       CA
       94104
    
    
      1
       12
                                    KIKKA
                250 EMBARCADERO  7/F
       San Francisco
       CA
       94105
       37.788613
      -122.393894
                  NaN
       H24
      NaN
       7/12/2002 0:00:00
                     KIKKA ITO, INC.
                431 South Isis Ave.
           Inglewood
       CA
       90301
    
    
      2
       17
                     GEORGE'S COFFEE SHOP
                   2200 OAKDALE AVE 
       San Francisco
       CA
       94124
       37.741086
      -122.401737
       (141) 555-5314
       H24
      NaN
        4/5/1975 0:00:00
         LIEUW, VICTOR & CHRISTINA C
                648 MACARTHUR DRIVE
           DALY CITY
       CA
       94015
    
    
      3
       19
                    NRGIZE LIFESTYLE CAFE
        1200 VAN NESS AVE, 3RD FLOOR
       San Francisco
       CA
       94109
       37.786848
      -122.421547
                  NaN
       H24
      NaN
                     NaN
                 24 Hour Fitness Inc
       1200 Van Ness Ave, 3rd Floor
       San Francisco
       CA
       94109
    
    
      4
       24
       OMNI S.F. HOTEL - 2ND FLOOR PANTRY
       500 CALIFORNIA ST, 2ND  FLOOR
       San Francisco
       CA
       94104
       37.792888
      -122.403135
                  NaN
       H24
      NaN
                     NaN
       OMNI San Francisco Hotel Corp
       500 California St, 2nd Floor
       San Francisco
       CA
       94104



In [7]:

    
SFfood_violations_plus = pd.read_csv("data/SFFoodProgram_Complete_Data/violations_plus.csv")

SFfood_businesses_plus.head()









    Out[7]:






  
    
      
      business_id
      name
      address
      city
      state
      postal_code
      latitude
      longitude
      phone_no
      TaxCode
      business_certificate
      application_date
      owner_name
      owner_address
      owner_city
      owner_state
      owner_zip
    
  
  
    
      0
       10
                         TIRAMISU KITCHEN
                       033 BELDEN PL
       San Francisco
       CA
       94104
       37.791116
      -122.403816
                  NaN
       H24
      NaN
                     NaN
                        Tiramisu LLC
                       33 Belden St
       San Francisco
       CA
       94104
    
    
      1
       12
                                    KIKKA
                250 EMBARCADERO  7/F
       San Francisco
       CA
       94105
       37.788613
      -122.393894
                  NaN
       H24
      NaN
       7/12/2002 0:00:00
                     KIKKA ITO, INC.
                431 South Isis Ave.
           Inglewood
       CA
       90301
    
    
      2
       17
                     GEORGE'S COFFEE SHOP
                   2200 OAKDALE AVE 
       San Francisco
       CA
       94124
       37.741086
      -122.401737
       (141) 555-5314
       H24
      NaN
        4/5/1975 0:00:00
         LIEUW, VICTOR & CHRISTINA C
                648 MACARTHUR DRIVE
           DALY CITY
       CA
       94015
    
    
      3
       19
                    NRGIZE LIFESTYLE CAFE
        1200 VAN NESS AVE, 3RD FLOOR
       San Francisco
       CA
       94109
       37.786848
      -122.421547
                  NaN
       H24
      NaN
                     NaN
                 24 Hour Fitness Inc
       1200 Van Ness Ave, 3rd Floor
       San Francisco
       CA
       94109
    
    
      4
       24
       OMNI S.F. HOTEL - 2ND FLOOR PANTRY
       500 CALIFORNIA ST, 2ND  FLOOR
       San Francisco
       CA
       94104
       37.792888
      -122.403135
                  NaN
       H24
      NaN
                     NaN
       OMNI San Francisco Hotel Corp
       500 California St, 2nd Floor
       San Francisco
       CA
       94104



In [8]:

    
# A simple way to find out how many rows are present and what columbs consist of numerical data , we can use describe()

SFfood_businesses_plus.describe()









    Out[8]:






  
    
      
      business_id
      latitude
      longitude
      business_certificate
    
  
  
    
      count
        6352.000000
       5495.000000
       5495.000000
          1131.000000
    
    
      mean
       32944.535894
         37.525775
       -121.622553
        449157.537577
    
    
      std
       28884.685537
          3.047733
          9.877572
        159777.164993
    
    
      min
          10.000000
          0.000000
       -122.510896
          4965.000000
    
    
      25%
        4138.500000
         37.760272
       -122.435457
        446211.000000
    
    
      50%
       28534.500000
         37.780568
       -122.418129
        465714.000000
    
    
      75%
       65468.500000
         37.789875
       -122.405568
        471461.000000
    
    
      max
       74591.000000
         37.875937
          0.000000
       4222215.000000



In [9]:

    
SFfood_businesses_plus.count() #NaN values are ignored









    Out[9]:





business_id             6352
name                    6352
address                 6350
city                    6352
state                   6352
postal_code             6121
latitude                5495
longitude               5495
phone_no                1461
TaxCode                 6352
business_certificate    1131
application_date        4481
owner_name              6342
owner_address           6331
owner_city              6263
owner_state             6262
owner_zip               6244
dtype: int64

Some conclutions

Some of data sheets are quite similar to other data sheets.
NaN signifies null values.
When we look in to these data sets, we find that only some features of data are useful while others are supposed to be filtered.
We need more analysis about how many columns a particular data sheet consist.Then we will try to join the data sheets.
Data fields that matters are business_id,name,address,latitude,longitude,scores,date which are present in businesses and inspection data sheets. Remaining data fields are either repeated or not required for data problems.
Almost every data set consist of business_id as a primary key, we could utilize it for joining data sheets.

Joining data



In [10]:

    
'''pandas data frames uses left outer join, therefore all records of SFbusiness_business will be preset
   even though corresponding rows are not present on SFbusiness_inspection '''

print SFbusiness_business.columns

print SFbusiness_inspections.columns

main_table = SFbusiness_business.merge( SFbusiness_inspections, on='business_id' )

print main_table.columns









    



Index([u'business_id', u'name', u'address', u'city', u'state', u'postal_code', u'latitude', u'longitude', u'phone_number'], dtype='object')
Index([u'business_id', u'Score', u'date', u'type'], dtype='object')
Index([u'business_id', u'name', u'address', u'city', u'state', u'postal_code', u'latitude', u'longitude', u'phone_number', u'Score', u'date', u'type'], dtype='object')



In [11]:

    
# let's just look at few rows of our main_table

main_table.head(10)









    Out[11]:






  
    
      
      business_id
      name
      address
      city
      state
      postal_code
      latitude
      longitude
      phone_number
      Score
      date
      type
    
  
  
    
      0
       10
           TIRAMISU KITCHEN
              033 BELDEN PL
       San Francisco
       CA
       94104
       37.791116
      -122.403816
               NaN
        98
       20121114
       routine
    
    
      1
       10
           TIRAMISU KITCHEN
              033 BELDEN PL
       San Francisco
       CA
       94104
       37.791116
      -122.403816
               NaN
        98
       20120403
       routine
    
    
      2
       10
           TIRAMISU KITCHEN
              033 BELDEN PL
       San Francisco
       CA
       94104
       37.791116
      -122.403816
               NaN
       100
       20110928
       routine
    
    
      3
       10
           TIRAMISU KITCHEN
              033 BELDEN PL
       San Francisco
       CA
       94104
       37.791116
      -122.403816
               NaN
        96
       20110428
       routine
    
    
      4
       10
           TIRAMISU KITCHEN
              033 BELDEN PL
       San Francisco
       CA
       94104
       37.791116
      -122.403816
               NaN
       100
       20101210
       routine
    
    
      5
       12
                      KIKKA
       250 EMBARCADERO  7/F
       San Francisco
       CA
       94105
       37.788613
      -122.393894
               NaN
       100
       20121120
       routine
    
    
      6
       12
                      KIKKA
       250 EMBARCADERO  7/F
       San Francisco
       CA
       94105
       37.788613
      -122.393894
               NaN
        98
       20120420
       routine
    
    
      7
       12
                      KIKKA
       250 EMBARCADERO  7/F
       San Francisco
       CA
       94105
       37.788613
      -122.393894
               NaN
       100
       20111018
       routine
    
    
      8
       12
                      KIKKA
       250 EMBARCADERO  7/F
       San Francisco
       CA
       94105
       37.788613
      -122.393894
               NaN
       100
       20110401
       routine
    
    
      9
       17
       GEORGE'S COFFEE SHOP
          2200 OAKDALE AVE 
       San Francisco
       CA
       94124
       37.741086
      -122.401737
       14155531470
       100
       20120823
       routine

Data Pre-Processing

Data is often found in a difficult to use manner. To imporve the accuracy, pre-processing is essential. We are using Biostatistics data from VANDERBILT UNIVERSITY for data pre-processing. You can find the data set here. For convinience I included it in the git repository.

Extracting useful features

Let's assume that we wanted to know how death of patient is dependent on age,sex,race,income. We are not interested in remaining features of the data set. Therefore we will make a pandas frame which serves our purpose.



In [12]:

    
data1 = pd.read_csv("data/support2.csv")
#Creating pandas data frame which that holds only few features about data such as age,sex,race,income and death(dead=1 | alive=0)
med = pd.DataFrame( {'age':data1['age'],
                   'death':data1['death'],
                    'sex':data1['sex'],
                    'race': data1['race'],
                    'income': data1['income'],
                     })
med.head(10)

Dealing with missing values

Most common pre-processing step is to deal with missing values. Pandas data frames automaticlly takes null values to be NaN. We can ignore those values or replace with '0'. Filling null values with appropriate central tendencies such as median, mean, mode is considered as a better practice. For this purpose, Series data structure could be useful. A Series is a one-dimensional array-like object.



In [13]:

    
from pandas import Series
seriesresult = Series(x for x in med['income'])
#replacing $11-$25k with 18
seriesresult=seriesresult.replace(to_replace='$11-$25k', value='18')
#replacing under $11 to 5.5
seriesresult=seriesresult.replace(to_replace='under $11k', value='5.5')
#replacing $25k-50k with 37.5
seriesresult=seriesresult.replace(to_replace='$25-$50k', value='37.5')
#replacing >$50k with 75
seriesresult=seriesresult.replace(to_replace='>$50k', value='75')
print seriesresult









    



0       18
1       18
2      5.5
3      5.5
4      NaN
5      NaN
6     37.5
7      NaN
8      NaN
9     37.5
10     NaN
11     NaN
12      18
13      18
14     NaN
...
9090      18
9091     5.5
9092     NaN
9093    37.5
9094     NaN
9095     NaN
9096      18
9097      18
9098     NaN
9099     5.5
9100     NaN
9101     NaN
9102     NaN
9103     NaN
9104      18
Length: 9105, dtype: object



In [14]:

    
# Checking for null values
print "\nCSV Value isnull: " + str(seriesresult.isnull())
# Ignoring null values
print "\nCSV Value dropna: " + str(seriesresult.dropna())
# Replacing with '0'
print "\nCSV Value fillna(0): " + str(seriesresult.fillna(0))









    



CSV Value isnull: 0     False
1     False
2     False
3     False
4      True
5      True
6     False
7      True
8      True
9     False
10     True
11     True
12    False
13    False
14     True
...
9090    False
9091    False
9092     True
9093    False
9094     True
9095     True
9096    False
9097    False
9098     True
9099    False
9100     True
9101     True
9102     True
9103     True
9104    False
Length: 9105, dtype: bool

CSV Value dropna: 0       18
1       18
2      5.5
3      5.5
6     37.5
9     37.5
12      18
13      18
15      75
17    37.5
18     5.5
19      18
20      75
22    37.5
23    37.5
...
9079    37.5
9080     5.5
9081    37.5
9083     5.5
9085      75
9086      18
9087      18
9088      18
9090      18
9091     5.5
9093    37.5
9096      18
9097      18
9099     5.5
9104      18
Length: 6123, dtype: object

CSV Value fillna(0): 0       18
1       18
2      5.5
3      5.5
4        0
5        0
6     37.5
7        0
8        0
9     37.5
10       0
11       0
12      18
13      18
14       0
...
9090      18
9091     5.5
9092       0
9093    37.5
9094       0
9095       0
9096      18
9097      18
9098       0
9099     5.5
9100       0
9101       0
9102       0
9103       0
9104      18
Length: 9105, dtype: object

Since, we are dealing with ordinal data, we could replace it with median.



In [15]:

    
l= str(seriesresult.median())
print "\nmedian: " + l
k = float(l)
print k
#replacing with median
print "\nCSV Value fillna(0): " + str(seriesresult.fillna(k))









    



median: 18.0
18.0

CSV Value fillna(0): 0       18
1       18
2      5.5
3      5.5
4       18
5       18
6     37.5
7       18
8       18
9     37.5
10      18
11      18
12      18
13      18
14      18
...
9090      18
9091     5.5
9092      18
9093    37.5
9094      18
9095      18
9096      18
9097      18
9098      18
9099     5.5
9100      18
9101      18
9102      18
9103      18
9104      18
Length: 9105, dtype: object

There are better ways to replace missing values. One of the ways is to use linear regression. We will try to fit the model with a linear equation. There is a column called charges mentioning mediacal bills. Let's see if charges and income have any trend togather.



In [16]:

    
ourfocus = pd.DataFrame({'income':data1['income'],
                         'charges':data1['charges']})
ourfocus['income']=seriesresult # putting result of seriesresult in place of ourfocus income column
ourfocus.head(10)

We should remove all the missing values here since we are trying to see correlation between charge and income.



In [17]:

    
import numpy as np
ourfocus = ourfocus.dropna().reset_index()
new = pd.DataFrame({'charges':ourfocus['charges'],
                    'income':ourfocus['income']})
#converting all the values of the data frame in to floats
new=new.applymap(lambda x:float(x))
#print ourfocus['charges'].mean
#print ourfocus['income'].mean
print new.head(10)
new.corr()









    



   charges  income
0     9715    18.0
1    34496    18.0
2    41094     5.5
3     3075     5.5
4    30460    37.5
5     9914    37.5
6     4353    18.0
7    19783    18.0
8    10758    75.0
9   283303    37.5






    Out[17]:






  
    
      
      charges
      income
    
  
  
    
      charges
       1.0000
       0.1237
    
    
      income
       0.1237
       1.0000

0.1237 means very sligt correlation exits between income and charges. So, now we know from above that we can't use charges to fill the missing values of income.

Data Visualization

I am using bokeh charts to show visualizations. You can find more about it here



In [18]:

    
#Scatter Plot
from collections import OrderedDict
from bokeh.charts import Scatter

data2 = data1.head(200) #copying first 200 to different data frame

data2['d.time'] = data2['d.time'].map(lambda x:x/365.0 ) # converting days in to years by diviing all values by 365

male = data2[(data2.sex == "male")][["age", "d.time"]]  

female = data2[(data2.sex == "female")][["age", "d.time"]] 

xyvalues = OrderedDict([("male", male.values), ("female", female.values)]) # using OrderedDict 

scatter = Scatter(xyvalues, filename = "plots/scatter.html") 
#scatter.notebook().show()
#output_notebook
#plot = scatter
scatter.title("Scatter Plot").xlabel("Age in years").ylabel("Years spent on hospitals").legend("top_left").width(600).height(400).show()
from IPython.display import HTML
HTML('<iframe src=plots/scatter.html width=700 height=500></iframe>')









    



Wrote plots/scatter.html






    Out[18]:



In [19]:

    
# Bar Graph
import pandas as pd

# let's constuct an anology on how many are hospital dead in dead for each race of people in bar chart
data2 = pd.DataFrame({'race': data1['race'],'normaldeath': data1['death'] ,'hospdead': data1['hospdead']})
dead = data2[data2['normaldeath']==1].groupby('race').count()
hospdead = data2[data2['hospdead']==1].groupby('race').count()
dead['normaldeath'] = dead['normaldeath'] - hospdead['hospdead']
dead['hospdead'] = hospdead['hospdead']
print dead
from bokeh.charts import Bar
bar = Bar(dead, filename="plots/bar1.html")
bar.title("Stacked Bar Graph").xlabel("Race").ylabel("Total number of people dead") .legend("top_left").width(600).height(700).stacked().show()
from IPython.display import HTML
HTML('<iframe src=plots/bar1.html width=700 height=800></iframe>')









    



          hospdead  normaldeath
race                           
asian           30           28
black          383          526
hispanic        68          105
other           37           44
white         1823         3125
Wrote plots/bar1.html






    Out[19]:

Machine Learning Algorithms

Unsupervised Learning

Kmeans clustering is easy to appy but it is very powerful in terms of output. we start by generating some artificial data.



In [20]:

    
# kmeans Clustering
import matplotlib.pyplot as plt
%matplotlib inline
plt.jet() # set the color map. When your colors are lost, re-run this.
import sklearn.datasets as datasets
X, Y = datasets.make_blobs(centers=6, cluster_std=0.5, random_state=0) #random data sets with 3 centers with std deviation of 0.5









    





<matplotlib.figure.Figure at 0x581bb90>



In [21]:

    
plt.scatter(X[:,0], X[:,1]);
plt.show()



In [22]:

    
from sklearn.cluster import KMeans
kmeans = KMeans(3, random_state=8)
Y_hat = kmeans.fit(X).labels_



In [23]:

    
plt.scatter(X[:,0], X[:,1], c=Y_hat);
plt.show()



In [24]:

    
plt.scatter(X[:,0], X[:,1], c=Y_hat, alpha=0.4)
mu = kmeans.cluster_centers_
plt.scatter(mu[:,0], mu[:,1], s=100, c=np.unique(Y_hat))
plt.show()
print mu









    












    



[[-1.23211442  8.04092475]
 [ 7.53975776 -0.94980578]
 [ 0.47403713  2.77387221]]



In [25]:

    
data3 = data1.head(200)
#print data3
plt.scatter(data3['age'], data3['d.time']);
plt.show()



In [26]:

    
# PCA demonstation on iris data set
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
iris = load_iris()
pca = PCA(n_components=2, whiten=True).fit(iris.data)
X_pca = pca.transform(iris.data)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=iris.target)
formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])
plt.colorbar(ticks=[0, 1, 2], format=formatter)
var_explained = pca.explained_variance_ratio_ * 100
plt.xlabel('First Component: {0:.1f}%'.format(var_explained[0]))
plt.ylabel('Second Component: {0:.1f}%'.format(var_explained[1]))









    Out[26]:





<matplotlib.text.Text at 0x6580310>

It is not necessary that you are doing something good by applying PCA to your data. There are more chances of losing accuracy than gaining by applying PCA to your data.

Supervised Learning - Regression



In [27]:

    
# Linear Regression
from sklearn import linear_model
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split
houses = datasets.load_boston()
houses_X = houses.data[:, np.newaxis]
houses_X_temp = houses_X[:, :, 2]
X_train, X_test, Y_train, Y_test = train_test_split(houses_X_temp, houses.target, test_size=0.45)
lreg = linear_model.LinearRegression()
lreg.fit(X_train, Y_train)
plt.scatter(X_test, Y_test, color='black')
plt.plot(X_test, lreg.predict(X_test), color='red', linewidth=3)
plt.show()



In [28]:

    
# Decision boundry regression
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets

# Let's write a estimator for convience so that we could reuse it.
def plot_estimator(estimator, X, Y):
 estimator.fit(X, Y)
 # Plot the decision boundary. For that, we will assign a color to each
 # point in the mesh [x_min, m_max]x[y_min, y_max].   
 x_min, x_max = X[:, 0].min() -0.5, X[:, 0].max()+0.5
 y_min, y_max = X[:, 1].min()-0.5 , X[:, 1].max()+0.5
 xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),np.linspace(y_min, y_max, 100))
 Z = estimator.predict(np.c_[xx.ravel(), yy.ravel()])
 # Put the result into a color map
 Z = Z.reshape(xx.shape)
 plt.figure()
 plt.xlabel('Sepal length')
 plt.ylabel('Sepal width')
 plt.xlim(xx.min(), xx.max())
 plt.ylim(yy.min(), yy.max())
 plt.xticks(())
 plt.yticks(())
 plt.pcolormesh(xx, yy, Z, alpha=0.2,cmap='rainbow')
 plt.scatter(X[:, 0], X[:, 1], c=Y, s=20 )


# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features.
Y = iris.target

logreg = linear_model.LogisticRegression(C=1e5)

# we create an instance of Neighbours Classifier and fit the data.
logreg.fit(X, Y)

plot_estimator(logreg,X,Y)

Supervised Learning - Classification



In [29]:

    
from sklearn.datasets.samples_generator import make_blobs
X, Y = make_blobs(n_samples=200, centers=2,
                  random_state=0, cluster_std=0.60)

plt.scatter(X[:, 0], X[:, 1], c=Y, s=20);



In [30]:

    
from sklearn.svm import SVC # "Support Vector"
clf = SVC(kernel='linear')
clf.fit(X, Y)









    Out[30]:





SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)



In [31]:

    
plt.scatter(X[:, 0], X[:, 1], c=Y, s=20)
x = np.linspace(plt.xlim()[0], plt.xlim()[1], 30)
y = np.linspace(plt.ylim()[0], plt.ylim()[1], 30)
Y, X = np.meshgrid(y, x)
P = np.zeros_like(X)
for i, xi in enumerate(x):
 for j, yj in enumerate(y):
  P[i, j] = clf.decision_function([xi, yj])
plt.contour(X, Y, P, colors='k',levels=[-1, 0, 1],linestyles=['--', '-', '--'])









    Out[31]:





<matplotlib.contour.QuadContourSet instance at 0x675c248>



In [32]:

    
# Decision tree classifier
from sklearn.tree import DecisionTreeClassifier
#X, Y = make_blobs(n_samples=500, centers=3,random_state=0, cluster_std=0.60)
X = iris.data[:, :2]  # we only take the first two features.
Y = iris.target
plt.scatter(X[:, 0], X[:, 1], c=Y, s=20)









    Out[32]:





<matplotlib.collections.PathCollection at 0x688f390>



In [33]:

    
clf = DecisionTreeClassifier(max_depth=10)
plot_estimator(clf, X, Y) # function call to plot_estimator

Decision trees tend to over fitting of data. Most of the models face the same problems. Better approach is to use a different kind of decision tree called random forest.



In [34]:

    
# Random forests
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10, random_state=0)
plot_estimator(clf, X, Y) # function call to plot estimator

References

scikit-learn main page
A gallery of interesting IPython Notebooks
Jay-Oh-eN git repository
Jakevdp git repository, sklearn pycon2014
Open content for self-directed learning in data science
Financial data analysis git repository

	business_id	name	address	city	state	postal_code	latitude	longitude	phone_number
0	10	TIRAMISU KITCHEN	033 BELDEN PL	San Francisco	CA	94104	37.791116	-122.403816	NaN
1	12	KIKKA	250 EMBARCADERO 7/F	San Francisco	CA	94105	37.788613	-122.393894	NaN
2	17	GEORGE'S COFFEE SHOP	2200 OAKDALE AVE	San Francisco	CA	94124	37.741086	-122.401737	14155531470
3	19	NRGIZE LIFESTYLE CAFE	1200 VAN NESS AVE, 3RD FLOOR	San Francisco	CA	94109	37.786848	-122.421547	NaN
4	24	OMNI S.F. HOTEL - 2ND FLOOR PANTRY	500 CALIFORNIA ST, 2ND FLOOR	San Francisco	CA	94104	37.792888	-122.403135	NaN

	business_id	Score	date	type
0	10	98	20121114	routine
1	10	98	20120403	routine
2	10	100	20110928	routine
3	10	96	20110428	routine
4	10	100	20101210	routine

	business_id	date	description
0	10	20121114	Unclean or degraded floors walls or ceilings ...
1	10	20120403	Unclean or degraded floors walls or ceilings ...
2	10	20110428	Inadequate and inaccessible handwashing facili...
3	12	20120420	Food safety certificate or food handler card n...
4	17	20120823	Inadequately cleaned or sanitized food contact...

	business_id	latitude	longitude	business_certificate
count	6352.000000	5495.000000	5495.000000	1131.000000
mean	32944.535894	37.525775	-121.622553	449157.537577
std	28884.685537	3.047733	9.877572	159777.164993
min	10.000000	0.000000	-122.510896	4965.000000
25%	4138.500000	37.760272	-122.435457	446211.000000
50%	28534.500000	37.780568	-122.418129	465714.000000
75%	65468.500000	37.789875	-122.405568	471461.000000
max	74591.000000	37.875937	0.000000	4222215.000000

	age	death	income	race	sex
0	62.84998	0	$11-$25k	other	male
1	60.33899	1	$11-$25k	white	female
2	52.74698	1	under $11k	white	female
3	42.38498	1	under $11k	white	female
4	79.88495	0	NaN	white	female
5	93.01599	1	NaN	white	male
6	62.37097	1	$25-$50k	white	male
7	86.83899	1	NaN	white	male
8	85.65594	1	NaN	black	male
9	42.25897	1	$25-$50k	hispanic	female

	charges	income
0	9715	18
1	34496	18
2	41094	5.5
3	3075	5.5
4	50127	NaN
5	6884	NaN
6	30460	37.5
7	30460	NaN
8	NaN	NaN
9	9914	37.5