for Data Science

Twitter: @manjush3

A tutorial on python for data science.

Contents

  • Background
  • Why use Python?
  • About this tutorial
  • Aquiring Data
  • Observing aquired data
    • Some conclutions
    • Joining data
  • Data Pre-Processing
    • Dealing with redundancies
    • Dealing with missing values
  • Data Visualization
  • Machine Learning Algorithms
    • Unsupervised
    • Supervised
  • Links to learn more

Background

Python is a high-level programming language that lets you work quickly and integrate systems more effectively.

Designed By

  • Guido van Rossum

on year 1991

Currently has 3 million+ contributors to the language

Stable release: v3.4.1 (2014-08-01),

Why Use Python?

  • Python is powerful... and fast;
  • plays well with others;
  • runs everywhere;
  • is friendly & easy to learn;
  • is Open.

About this tutorial

Data science is a very powerful subject. It is the science of pulling useful insights from data. Data science gained lot of popularity in the recent years. In this tutorial, we will try to explore some of python tools and solve a data problem.

Aquiring data

For this tutorial we will try to use publicly available data sets. San Francisco department of public health maintains data sets about restaurants safety scores. Since data is publicly available, aquiring them is easy. If data is available in a website which do not have any API support, we can use web scraping techniques. Since there are lot of tutorials on how to get data, I am skipping that part. For convinience, I added all the requisite data sets in to the repository . I found Jay-Oh-eN's repository quite helpful for reference.

Observing aquired data

In general, there are two kinds of data science problems. First kind could only be solved if we have domain knowlege about the data sets and the second kind are those which can be solved by all data scientists without any prior domain knowledge. Let's just look at first few rows of data sets, just to know about what kind of data we are dealing with.


In [1]:
import pandas as pd

SFbusiness_business = pd.read_csv("data/SFBusinesses/businesses.csv")

SFbusiness_business.head()


Out[1]:
business_id name address city state postal_code latitude longitude phone_number
0 10 TIRAMISU KITCHEN 033 BELDEN PL San Francisco CA 94104 37.791116 -122.403816 NaN
1 12 KIKKA 250 EMBARCADERO 7/F San Francisco CA 94105 37.788613 -122.393894 NaN
2 17 GEORGE'S COFFEE SHOP 2200 OAKDALE AVE San Francisco CA 94124 37.741086 -122.401737 14155531470
3 19 NRGIZE LIFESTYLE CAFE 1200 VAN NESS AVE, 3RD FLOOR San Francisco CA 94109 37.786848 -122.421547 NaN
4 24 OMNI S.F. HOTEL - 2ND FLOOR PANTRY 500 CALIFORNIA ST, 2ND FLOOR San Francisco CA 94104 37.792888 -122.403135 NaN

In [2]:
SFbusiness_inspections = pd.read_csv("data/SFBusinesses/inspections.csv")

SFbusiness_inspections.head()


Out[2]:
business_id Score date type
0 10 98 20121114 routine
1 10 98 20120403 routine
2 10 100 20110928 routine
3 10 96 20110428 routine
4 10 100 20101210 routine

In [3]:
SFbusiness_ScoreLegend = pd.read_csv("data/SFBusinesses/ScoreLegend.csv")

SFbusiness_ScoreLegend.head()


Out[3]:
Minimum_Score Maximum_Score Description
0 0 70 Poor
1 71 85 Needs Improvement
2 86 90 Adequate
3 91 100 Good

In [4]:
SFbusiness_violations = pd.read_csv("data/SFBusinesses/violations.csv")

SFbusiness_violations.head()


Out[4]:
business_id date description
0 10 20121114 Unclean or degraded floors walls or ceilings ...
1 10 20120403 Unclean or degraded floors walls or ceilings ...
2 10 20110428 Inadequate and inaccessible handwashing facili...
3 12 20120420 Food safety certificate or food handler card n...
4 17 20120823 Inadequately cleaned or sanitized food contact...

In [5]:
SFfood_businesses_plus = pd.read_csv("data/SFFoodProgram_Complete_Data/businesses_plus.csv")

SFfood_businesses_plus.head()


Out[5]:
business_id name address city state postal_code latitude longitude phone_no TaxCode business_certificate application_date owner_name owner_address owner_city owner_state owner_zip
0 10 TIRAMISU KITCHEN 033 BELDEN PL San Francisco CA 94104 37.791116 -122.403816 NaN H24 NaN NaN Tiramisu LLC 33 Belden St San Francisco CA 94104
1 12 KIKKA 250 EMBARCADERO 7/F San Francisco CA 94105 37.788613 -122.393894 NaN H24 NaN 7/12/2002 0:00:00 KIKKA ITO, INC. 431 South Isis Ave. Inglewood CA 90301
2 17 GEORGE'S COFFEE SHOP 2200 OAKDALE AVE San Francisco CA 94124 37.741086 -122.401737 (141) 555-5314 H24 NaN 4/5/1975 0:00:00 LIEUW, VICTOR & CHRISTINA C 648 MACARTHUR DRIVE DALY CITY CA 94015
3 19 NRGIZE LIFESTYLE CAFE 1200 VAN NESS AVE, 3RD FLOOR San Francisco CA 94109 37.786848 -122.421547 NaN H24 NaN NaN 24 Hour Fitness Inc 1200 Van Ness Ave, 3rd Floor San Francisco CA 94109
4 24 OMNI S.F. HOTEL - 2ND FLOOR PANTRY 500 CALIFORNIA ST, 2ND FLOOR San Francisco CA 94104 37.792888 -122.403135 NaN H24 NaN NaN OMNI San Francisco Hotel Corp 500 California St, 2nd Floor San Francisco CA 94104

In [6]:
SFfood_inspections_plus = pd.read_csv("data/SFFoodProgram_Complete_Data/inspections_plus.csv")

SFfood_businesses_plus.head()


Out[6]:
business_id name address city state postal_code latitude longitude phone_no TaxCode business_certificate application_date owner_name owner_address owner_city owner_state owner_zip
0 10 TIRAMISU KITCHEN 033 BELDEN PL San Francisco CA 94104 37.791116 -122.403816 NaN H24 NaN NaN Tiramisu LLC 33 Belden St San Francisco CA 94104
1 12 KIKKA 250 EMBARCADERO 7/F San Francisco CA 94105 37.788613 -122.393894 NaN H24 NaN 7/12/2002 0:00:00 KIKKA ITO, INC. 431 South Isis Ave. Inglewood CA 90301
2 17 GEORGE'S COFFEE SHOP 2200 OAKDALE AVE San Francisco CA 94124 37.741086 -122.401737 (141) 555-5314 H24 NaN 4/5/1975 0:00:00 LIEUW, VICTOR & CHRISTINA C 648 MACARTHUR DRIVE DALY CITY CA 94015
3 19 NRGIZE LIFESTYLE CAFE 1200 VAN NESS AVE, 3RD FLOOR San Francisco CA 94109 37.786848 -122.421547 NaN H24 NaN NaN 24 Hour Fitness Inc 1200 Van Ness Ave, 3rd Floor San Francisco CA 94109
4 24 OMNI S.F. HOTEL - 2ND FLOOR PANTRY 500 CALIFORNIA ST, 2ND FLOOR San Francisco CA 94104 37.792888 -122.403135 NaN H24 NaN NaN OMNI San Francisco Hotel Corp 500 California St, 2nd Floor San Francisco CA 94104

In [7]:
SFfood_violations_plus = pd.read_csv("data/SFFoodProgram_Complete_Data/violations_plus.csv")

SFfood_businesses_plus.head()


Out[7]:
business_id name address city state postal_code latitude longitude phone_no TaxCode business_certificate application_date owner_name owner_address owner_city owner_state owner_zip
0 10 TIRAMISU KITCHEN 033 BELDEN PL San Francisco CA 94104 37.791116 -122.403816 NaN H24 NaN NaN Tiramisu LLC 33 Belden St San Francisco CA 94104
1 12 KIKKA 250 EMBARCADERO 7/F San Francisco CA 94105 37.788613 -122.393894 NaN H24 NaN 7/12/2002 0:00:00 KIKKA ITO, INC. 431 South Isis Ave. Inglewood CA 90301
2 17 GEORGE'S COFFEE SHOP 2200 OAKDALE AVE San Francisco CA 94124 37.741086 -122.401737 (141) 555-5314 H24 NaN 4/5/1975 0:00:00 LIEUW, VICTOR & CHRISTINA C 648 MACARTHUR DRIVE DALY CITY CA 94015
3 19 NRGIZE LIFESTYLE CAFE 1200 VAN NESS AVE, 3RD FLOOR San Francisco CA 94109 37.786848 -122.421547 NaN H24 NaN NaN 24 Hour Fitness Inc 1200 Van Ness Ave, 3rd Floor San Francisco CA 94109
4 24 OMNI S.F. HOTEL - 2ND FLOOR PANTRY 500 CALIFORNIA ST, 2ND FLOOR San Francisco CA 94104 37.792888 -122.403135 NaN H24 NaN NaN OMNI San Francisco Hotel Corp 500 California St, 2nd Floor San Francisco CA 94104

In [8]:
# A simple way to find out how many rows are present and what columbs consist of numerical data , we can use describe()

SFfood_businesses_plus.describe()


Out[8]:
business_id latitude longitude business_certificate
count 6352.000000 5495.000000 5495.000000 1131.000000
mean 32944.535894 37.525775 -121.622553 449157.537577
std 28884.685537 3.047733 9.877572 159777.164993
min 10.000000 0.000000 -122.510896 4965.000000
25% 4138.500000 37.760272 -122.435457 446211.000000
50% 28534.500000 37.780568 -122.418129 465714.000000
75% 65468.500000 37.789875 -122.405568 471461.000000
max 74591.000000 37.875937 0.000000 4222215.000000

In [9]:
SFfood_businesses_plus.count() #NaN values are ignored


Out[9]:
business_id             6352
name                    6352
address                 6350
city                    6352
state                   6352
postal_code             6121
latitude                5495
longitude               5495
phone_no                1461
TaxCode                 6352
business_certificate    1131
application_date        4481
owner_name              6342
owner_address           6331
owner_city              6263
owner_state             6262
owner_zip               6244

Some conclutions

  • Some of data sheets are quite similar to other data sheets.
  • NaN signifies null values.
  • When we look in to these data sets, we find that only some features of data are useful while others are supposed to be filtered.
  • We need more analysis about how many columns a particular data sheet consist.Then we will try to join the data sheets.
  • Data fields that matters are business_id,name,address,latitude,longitude,scores,date which are present in businesses and inspection data sheets. Remaining data fields are either repeated or not required for data problems.
  • Almost every data set consist of business_id as a primary key, we could utilize it for joining data sheets.

Joining data


In [10]:
'''pandas data frames uses left outer join, therefore all records of SFbusiness_business will be preset
   even though corresponding rows are not present on SFbusiness_inspection '''

print SFbusiness_business.columns

print SFbusiness_inspections.columns

main_table = SFbusiness_business.merge( SFbusiness_inspections, on='business_id' )

print main_table.columns


Index([business_id, name, address, city, state, postal_code, latitude, longitude, phone_number], dtype=object)
Index([business_id, Score, date, type], dtype=object)
Index([business_id, name, address, city, state, postal_code, latitude, longitude, phone_number, Score, date, type], dtype=object)

In [11]:
# let's just look at few rows of our main_table

main_table.head()


Out[11]:
business_id name address city state postal_code latitude longitude phone_number Score date type
0 10 TIRAMISU KITCHEN 033 BELDEN PL San Francisco CA 94104 37.791116 -122.403816 NaN 98 20121114 routine
1 10 TIRAMISU KITCHEN 033 BELDEN PL San Francisco CA 94104 37.791116 -122.403816 NaN 98 20120403 routine
2 10 TIRAMISU KITCHEN 033 BELDEN PL San Francisco CA 94104 37.791116 -122.403816 NaN 100 20110928 routine
3 10 TIRAMISU KITCHEN 033 BELDEN PL San Francisco CA 94104 37.791116 -122.403816 NaN 96 20110428 routine
4 10 TIRAMISU KITCHEN 033 BELDEN PL San Francisco CA 94104 37.791116 -122.403816 NaN 100 20101210 routine

Data Pre-Processing


In [ ]: