for Data Science

Twitter: @manjush3

A tutorial on python for data science.

Background
Why use Python?
About this tutorial
Aquiring Data
Observing aquired data
- Some conclutions
- Joining data
Data Pre-Processing
- Dealing with redundancies
- Dealing with missing values
Data Visualization
Machine Learning Algorithms
- Unsupervised
- Supervised
Links to learn more

Background

Python is a high-level programming language that lets you work quickly and integrate systems more effectively.

Designed By

Guido van Rossum

on year 1991

Currently has 3 million+ contributors to the language

Stable release: v3.4.1 (2014-08-01),

Why Use Python?

Python is powerful... and fast;
plays well with others;
runs everywhere;
is friendly & easy to learn;
is Open.

About this tutorial

Data science is a very powerful subject. It is the science of pulling useful insights from data. Data science gained lot of popularity in the recent years. In this tutorial, we will try to explore some of python tools and solve a data problem.

Aquiring data

For this tutorial we will try to use publicly available data sets. San Francisco department of public health maintains data sets about restaurants safety scores. Since data is publicly available, aquiring them is easy. If data is available in a website which do not have any API support, we can use web scraping techniques. Since there are lot of tutorials on how to get data, I am skipping that part. For convinience, I added all the requisite data sets in to the repository . I found Jay-Oh-eN's repository quite helpful for reference.

Observing aquired data

In general, there are two kinds of data science problems. First kind could only be solved if we have domain knowlege about the data sets and the second kind are those which can be solved by all data scientists without any prior domain knowledge. Let's just look at first few rows of data sets, just to know about what kind of data we are dealing with.



In [1]:

    
import pandas as pd

SFbusiness_business = pd.read_csv("data/SFBusinesses/businesses.csv")

SFbusiness_business.head()









    Out[1]:






  
    
      
      business_id
      name
      address
      city
      state
      postal_code
      latitude
      longitude
      phone_number
    
  
  
    
      0
       10
                         TIRAMISU KITCHEN
                       033 BELDEN PL
       San Francisco
       CA
       94104
       37.791116
      -122.403816
               NaN
    
    
      1
       12
                                    KIKKA
                250 EMBARCADERO  7/F
       San Francisco
       CA
       94105
       37.788613
      -122.393894
               NaN
    
    
      2
       17
                     GEORGE'S COFFEE SHOP
                   2200 OAKDALE AVE 
       San Francisco
       CA
       94124
       37.741086
      -122.401737
       14155531470
    
    
      3
       19
                    NRGIZE LIFESTYLE CAFE
        1200 VAN NESS AVE, 3RD FLOOR
       San Francisco
       CA
       94109
       37.786848
      -122.421547
               NaN
    
    
      4
       24
       OMNI S.F. HOTEL - 2ND FLOOR PANTRY
       500 CALIFORNIA ST, 2ND  FLOOR
       San Francisco
       CA
       94104
       37.792888
      -122.403135
               NaN



In [2]:

    
SFbusiness_inspections = pd.read_csv("data/SFBusinesses/inspections.csv")

SFbusiness_inspections.head()









    Out[2]:






  
    
      
      business_id
      Score
      date
      type
    
  
  
    
      0
       10
        98
       20121114
       routine
    
    
      1
       10
        98
       20120403
       routine
    
    
      2
       10
       100
       20110928
       routine
    
    
      3
       10
        96
       20110428
       routine
    
    
      4
       10
       100
       20101210
       routine



In [3]:

    
SFbusiness_ScoreLegend = pd.read_csv("data/SFBusinesses/ScoreLegend.csv")

SFbusiness_ScoreLegend.head()









    Out[3]:






  
    
      
      Minimum_Score
      Maximum_Score
      Description
    
  
  
    
      0
        0
        70
                    Poor
    
    
      1
       71
        85
       Needs Improvement
    
    
      2
       86
        90
                Adequate
    
    
      3
       91
       100
                    Good



In [4]:

    
SFbusiness_violations = pd.read_csv("data/SFBusinesses/violations.csv")

SFbusiness_violations.head()









    Out[4]:






  
    
      
      business_id
      date
      description
    
  
  
    
      0
       10
       20121114
       Unclean or degraded floors walls or ceilings  ...
    
    
      1
       10
       20120403
       Unclean or degraded floors walls or ceilings  ...
    
    
      2
       10
       20110428
       Inadequate and inaccessible handwashing facili...
    
    
      3
       12
       20120420
       Food safety certificate or food handler card n...
    
    
      4
       17
       20120823
       Inadequately cleaned or sanitized food contact...



In [5]:

    
SFfood_businesses_plus = pd.read_csv("data/SFFoodProgram_Complete_Data/businesses_plus.csv")

SFfood_businesses_plus.head()









    Out[5]:






  
    
      
      business_id
      name
      address
      city
      state
      postal_code
      latitude
      longitude
      phone_no
      TaxCode
      business_certificate
      application_date
      owner_name
      owner_address
      owner_city
      owner_state
      owner_zip
    
  
  
    
      0
       10
                         TIRAMISU KITCHEN
                       033 BELDEN PL
       San Francisco
       CA
       94104
       37.791116
      -122.403816
                  NaN
       H24
      NaN
                     NaN
                        Tiramisu LLC
                       33 Belden St
       San Francisco
       CA
       94104
    
    
      1
       12
                                    KIKKA
                250 EMBARCADERO  7/F
       San Francisco
       CA
       94105
       37.788613
      -122.393894
                  NaN
       H24
      NaN
       7/12/2002 0:00:00
                     KIKKA ITO, INC.
                431 South Isis Ave.
           Inglewood
       CA
       90301
    
    
      2
       17
                     GEORGE'S COFFEE SHOP
                   2200 OAKDALE AVE 
       San Francisco
       CA
       94124
       37.741086
      -122.401737
       (141) 555-5314
       H24
      NaN
        4/5/1975 0:00:00
         LIEUW, VICTOR & CHRISTINA C
                648 MACARTHUR DRIVE
           DALY CITY
       CA
       94015
    
    
      3
       19
                    NRGIZE LIFESTYLE CAFE
        1200 VAN NESS AVE, 3RD FLOOR
       San Francisco
       CA
       94109
       37.786848
      -122.421547
                  NaN
       H24
      NaN
                     NaN
                 24 Hour Fitness Inc
       1200 Van Ness Ave, 3rd Floor
       San Francisco
       CA
       94109
    
    
      4
       24
       OMNI S.F. HOTEL - 2ND FLOOR PANTRY
       500 CALIFORNIA ST, 2ND  FLOOR
       San Francisco
       CA
       94104
       37.792888
      -122.403135
                  NaN
       H24
      NaN
                     NaN
       OMNI San Francisco Hotel Corp
       500 California St, 2nd Floor
       San Francisco
       CA
       94104



In [6]:

    
SFfood_inspections_plus = pd.read_csv("data/SFFoodProgram_Complete_Data/inspections_plus.csv")

SFfood_businesses_plus.head()









    Out[6]:






  
    
      
      business_id
      name
      address
      city
      state
      postal_code
      latitude
      longitude
      phone_no
      TaxCode
      business_certificate
      application_date
      owner_name
      owner_address
      owner_city
      owner_state
      owner_zip
    
  
  
    
      0
       10
                         TIRAMISU KITCHEN
                       033 BELDEN PL
       San Francisco
       CA
       94104
       37.791116
      -122.403816
                  NaN
       H24
      NaN
                     NaN
                        Tiramisu LLC
                       33 Belden St
       San Francisco
       CA
       94104
    
    
      1
       12
                                    KIKKA
                250 EMBARCADERO  7/F
       San Francisco
       CA
       94105
       37.788613
      -122.393894
                  NaN
       H24
      NaN
       7/12/2002 0:00:00
                     KIKKA ITO, INC.
                431 South Isis Ave.
           Inglewood
       CA
       90301
    
    
      2
       17
                     GEORGE'S COFFEE SHOP
                   2200 OAKDALE AVE 
       San Francisco
       CA
       94124
       37.741086
      -122.401737
       (141) 555-5314
       H24
      NaN
        4/5/1975 0:00:00
         LIEUW, VICTOR & CHRISTINA C
                648 MACARTHUR DRIVE
           DALY CITY
       CA
       94015
    
    
      3
       19
                    NRGIZE LIFESTYLE CAFE
        1200 VAN NESS AVE, 3RD FLOOR
       San Francisco
       CA
       94109
       37.786848
      -122.421547
                  NaN
       H24
      NaN
                     NaN
                 24 Hour Fitness Inc
       1200 Van Ness Ave, 3rd Floor
       San Francisco
       CA
       94109
    
    
      4
       24
       OMNI S.F. HOTEL - 2ND FLOOR PANTRY
       500 CALIFORNIA ST, 2ND  FLOOR
       San Francisco
       CA
       94104
       37.792888
      -122.403135
                  NaN
       H24
      NaN
                     NaN
       OMNI San Francisco Hotel Corp
       500 California St, 2nd Floor
       San Francisco
       CA
       94104



In [7]:

    
SFfood_violations_plus = pd.read_csv("data/SFFoodProgram_Complete_Data/violations_plus.csv")

SFfood_businesses_plus.head()









    Out[7]:






  
    
      
      business_id
      name
      address
      city
      state
      postal_code
      latitude
      longitude
      phone_no
      TaxCode
      business_certificate
      application_date
      owner_name
      owner_address
      owner_city
      owner_state
      owner_zip
    
  
  
    
      0
       10
                         TIRAMISU KITCHEN
                       033 BELDEN PL
       San Francisco
       CA
       94104
       37.791116
      -122.403816
                  NaN
       H24
      NaN
                     NaN
                        Tiramisu LLC
                       33 Belden St
       San Francisco
       CA
       94104
    
    
      1
       12
                                    KIKKA
                250 EMBARCADERO  7/F
       San Francisco
       CA
       94105
       37.788613
      -122.393894
                  NaN
       H24
      NaN
       7/12/2002 0:00:00
                     KIKKA ITO, INC.
                431 South Isis Ave.
           Inglewood
       CA
       90301
    
    
      2
       17
                     GEORGE'S COFFEE SHOP
                   2200 OAKDALE AVE 
       San Francisco
       CA
       94124
       37.741086
      -122.401737
       (141) 555-5314
       H24
      NaN
        4/5/1975 0:00:00
         LIEUW, VICTOR & CHRISTINA C
                648 MACARTHUR DRIVE
           DALY CITY
       CA
       94015
    
    
      3
       19
                    NRGIZE LIFESTYLE CAFE
        1200 VAN NESS AVE, 3RD FLOOR
       San Francisco
       CA
       94109
       37.786848
      -122.421547
                  NaN
       H24
      NaN
                     NaN
                 24 Hour Fitness Inc
       1200 Van Ness Ave, 3rd Floor
       San Francisco
       CA
       94109
    
    
      4
       24
       OMNI S.F. HOTEL - 2ND FLOOR PANTRY
       500 CALIFORNIA ST, 2ND  FLOOR
       San Francisco
       CA
       94104
       37.792888
      -122.403135
                  NaN
       H24
      NaN
                     NaN
       OMNI San Francisco Hotel Corp
       500 California St, 2nd Floor
       San Francisco
       CA
       94104



In [8]:

    
# A simple way to find out how many rows are present and what columbs consist of numerical data , we can use describe()

SFfood_businesses_plus.describe()









    Out[8]:






  
    
      
      business_id
      latitude
      longitude
      business_certificate
    
  
  
    
      count
        6352.000000
       5495.000000
       5495.000000
          1131.000000
    
    
      mean
       32944.535894
         37.525775
       -121.622553
        449157.537577
    
    
      std
       28884.685537
          3.047733
          9.877572
        159777.164993
    
    
      min
          10.000000
          0.000000
       -122.510896
          4965.000000
    
    
      25%
        4138.500000
         37.760272
       -122.435457
        446211.000000
    
    
      50%
       28534.500000
         37.780568
       -122.418129
        465714.000000
    
    
      75%
       65468.500000
         37.789875
       -122.405568
        471461.000000
    
    
      max
       74591.000000
         37.875937
          0.000000
       4222215.000000



In [9]:

    
SFfood_businesses_plus.count() #NaN values are ignored









    Out[9]:





business_id             6352
name                    6352
address                 6350
city                    6352
state                   6352
postal_code             6121
latitude                5495
longitude               5495
phone_no                1461
TaxCode                 6352
business_certificate    1131
application_date        4481
owner_name              6342
owner_address           6331
owner_city              6263
owner_state             6262
owner_zip               6244

Some conclutions

Some of data sheets are quite similar to other data sheets.
NaN signifies null values.
When we look in to these data sets, we find that only some features of data are useful while others are supposed to be filtered.
We need more analysis about how many columns a particular data sheet consist.Then we will try to join the data sheets.
Data fields that matters are business_id,name,address,latitude,longitude,scores,date which are present in businesses and inspection data sheets. Remaining data fields are either repeated or not required for data problems.
Almost every data set consist of business_id as a primary key, we could utilize it for joining data sheets.

Joining data



In [10]:

    
'''pandas data frames uses left outer join, therefore all records of SFbusiness_business will be preset
   even though corresponding rows are not present on SFbusiness_inspection '''

print SFbusiness_business.columns

print SFbusiness_inspections.columns

main_table = SFbusiness_business.merge( SFbusiness_inspections, on='business_id' )

print main_table.columns









    



Index([business_id, name, address, city, state, postal_code, latitude, longitude, phone_number], dtype=object)
Index([business_id, Score, date, type], dtype=object)
Index([business_id, name, address, city, state, postal_code, latitude, longitude, phone_number, Score, date, type], dtype=object)



In [11]:

    
# let's just look at few rows of our main_table

main_table.head()









    Out[11]:






  
    
      
      business_id
      name
      address
      city
      state
      postal_code
      latitude
      longitude
      phone_number
      Score
      date
      type
    
  
  
    
      0
       10
       TIRAMISU KITCHEN
       033 BELDEN PL
       San Francisco
       CA
       94104
       37.791116
      -122.403816
      NaN
        98
       20121114
       routine
    
    
      1
       10
       TIRAMISU KITCHEN
       033 BELDEN PL
       San Francisco
       CA
       94104
       37.791116
      -122.403816
      NaN
        98
       20120403
       routine
    
    
      2
       10
       TIRAMISU KITCHEN
       033 BELDEN PL
       San Francisco
       CA
       94104
       37.791116
      -122.403816
      NaN
       100
       20110928
       routine
    
    
      3
       10
       TIRAMISU KITCHEN
       033 BELDEN PL
       San Francisco
       CA
       94104
       37.791116
      -122.403816
      NaN
        96
       20110428
       routine
    
    
      4
       10
       TIRAMISU KITCHEN
       033 BELDEN PL
       San Francisco
       CA
       94104
       37.791116
      -122.403816
      NaN
       100
       20101210
       routine

Data Pre-Processing



In [ ]:

	business_id	name	address	city	state	postal_code	latitude	longitude	phone_number
0	10	TIRAMISU KITCHEN	033 BELDEN PL	San Francisco	CA	94104	37.791116	-122.403816	NaN
1	12	KIKKA	250 EMBARCADERO 7/F	San Francisco	CA	94105	37.788613	-122.393894	NaN
2	17	GEORGE'S COFFEE SHOP	2200 OAKDALE AVE	San Francisco	CA	94124	37.741086	-122.401737	14155531470
3	19	NRGIZE LIFESTYLE CAFE	1200 VAN NESS AVE, 3RD FLOOR	San Francisco	CA	94109	37.786848	-122.421547	NaN
4	24	OMNI S.F. HOTEL - 2ND FLOOR PANTRY	500 CALIFORNIA ST, 2ND FLOOR	San Francisco	CA	94104	37.792888	-122.403135	NaN

	business_id	Score	date	type
0	10	98	20121114	routine
1	10	98	20120403	routine
2	10	100	20110928	routine
3	10	96	20110428	routine
4	10	100	20101210	routine

	business_id	date	description
0	10	20121114	Unclean or degraded floors walls or ceilings ...
1	10	20120403	Unclean or degraded floors walls or ceilings ...
2	10	20110428	Inadequate and inaccessible handwashing facili...
3	12	20120420	Food safety certificate or food handler card n...
4	17	20120823	Inadequately cleaned or sanitized food contact...

	business_id	latitude	longitude	business_certificate
count	6352.000000	5495.000000	5495.000000	1131.000000
mean	32944.535894	37.525775	-121.622553	449157.537577
std	28884.685537	3.047733	9.877572	159777.164993
min	10.000000	0.000000	-122.510896	4965.000000
25%	4138.500000	37.760272	-122.435457	446211.000000
50%	28534.500000	37.780568	-122.418129	465714.000000
75%	65468.500000	37.789875	-122.405568	471461.000000
max	74591.000000	37.875937	0.000000	4222215.000000