For this tutorial we will try to use publicly available data sets. San Francisco department of public health maintains data sets about restaurants safety scores. Since data is publicly available, aquiring them is easy. If data is available in a website which do not have any API support, we can use web scraping techniques. Since there are lot of tutorials on how to get data, I am skipping that part. For convinience, I added all the requisite data sets in to the repository . I found Jay-Oh-eN's repository quite helpful for reference.
In general, there are two kinds of data science problems. First kind could only be solved if we have domain knowlege about the data sets and the second kind are those which can be solved by all data scientists without any prior domain knowledge. Let's just look at first few rows of data sets, just to know about what kind of data we are dealing with.
In [1]:
import pandas as pd
SFbusiness_business = pd.read_csv("data/SFBusinesses/businesses.csv")
SFbusiness_business.head()
Out[1]:
In [2]:
SFbusiness_inspections = pd.read_csv("data/SFBusinesses/inspections.csv")
SFbusiness_inspections.head()
Out[2]:
In [3]:
SFbusiness_ScoreLegend = pd.read_csv("data/SFBusinesses/ScoreLegend.csv")
SFbusiness_ScoreLegend.head()
Out[3]:
In [4]:
SFbusiness_violations = pd.read_csv("data/SFBusinesses/violations.csv")
SFbusiness_violations.head()
Out[4]:
In [5]:
SFfood_businesses_plus = pd.read_csv("data/SFFoodProgram_Complete_Data/businesses_plus.csv")
SFfood_businesses_plus.head()
Out[5]:
In [6]:
SFfood_inspections_plus = pd.read_csv("data/SFFoodProgram_Complete_Data/inspections_plus.csv")
SFfood_businesses_plus.head()
Out[6]:
In [7]:
SFfood_violations_plus = pd.read_csv("data/SFFoodProgram_Complete_Data/violations_plus.csv")
SFfood_businesses_plus.head()
Out[7]:
In [8]:
# A simple way to find out how many rows are present and what columbs consist of numerical data , we can use describe()
SFfood_businesses_plus.describe()
Out[8]:
In [9]:
SFfood_businesses_plus.count() #NaN values are ignored
Out[9]:
In [10]:
'''pandas data frames uses left outer join, therefore all records of SFbusiness_business will be preset
even though corresponding rows are not present on SFbusiness_inspection '''
print SFbusiness_business.columns
print SFbusiness_inspections.columns
main_table = SFbusiness_business.merge( SFbusiness_inspections, on='business_id' )
print main_table.columns
In [11]:
# let's just look at few rows of our main_table
main_table.head()
Out[11]:
In [ ]: