Python is a high-level general purpose programming language named after a British comedy troup, created by a Dutch programmer as a hobby project and maintained by an international group of friendly but opinionated python enthusiasts (import this!
). Until June 2018, Guido van Rossum was the Benevolent dictator for life for the Python language, now decisions are made jointly by the Python Steering Council.
Python is popular for data science because it's powerful, fast, plays well with others, runs everywhere, is easy to learn, highly readable, and open. Because it's general purpose it can be used for full-stack development. It's got a growing list of useful libraries for scientitic programming, data manipulation, data analysis. (Numpy, Scipy, Pandas, Scikit-Learn, Statsmodels, Matplotlib, Pybrain, etc.)
iPython is an enhanced, interactive python interpreter started as a grad school project by Fernando Perez. iPython (jupyter) notebooks allow you to run a multi-language (Python, R, Julia, Markdown, LaTex, etc) interpreter in your browser to create rich, portable, and sharable code documents.
Pandas is a libary created by Wes McKinney that introduces the R-like dataframe object to Python and makes working with data in Python a lot easier. It's also a lot more efficient than the R dataframe and pretty much makes Python superior to R in every imaginable way (except for ggplot 2).
To start up a Jupyter notebook server, simply navigate to the directory where you want the notebooks to be saved and run the command
jupyter notebook
A browser should open with a notebook navigator. Click the "New" button and select "Python 3".
A beautiful blank notebook should open in a new tab
Name the notebook by clicking on "Untitled" at the top of the page.
Notebooks are squences of cells. Cells can be markdown, code, or raw text. Change the first cell to markdown and briefly describe what you are going to do in the notebook.
In [222]:
# Import Statements
import pandas as pd
import numpy as np
%matplotlib inline
In [223]:
crimes = pd.read_csv('chicago_past_year_crimes.csv')
So far we've been working with raw text files. That's one way to store and interact with data, but there are only a limited set of functions that can take as input raw text. Python has an amazing array of of data structures to work with that give you a lot of extra power in working with data.
Built-in Data Structures
Additional Essential Data Structures
Today we'll primarily be working with the pandas DataFrame. The pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes. It's basically a spreadsheet you can program and it's an incredibly useful Python object for data analysis.
You can load data into a dataframe using Pandas' excellent read_*
functions.
We're going to try two of them: read_table & read_csv
Pro tip: TAB COMPLETION!
Pro tip: jupyter will pull the doc string for a command just by asking it a question.
Pro tip: jupyter will give you the allowable arguments if you hit shift + tab
In [224]:
crimes.head()
Out[224]:
In [225]:
crimes.tail()
Out[225]:
In [226]:
crimes.shape
Out[226]:
In [227]:
crimes.dtypes
Out[227]:
Pro tip: you'll notice that some commands have looked like pd.something(), some like data.something(), and some like data.something without (). The difference is a pandas function or class vs methods vs attributes. Methods are actions you take on a dataframe or series, while attributes are descriptors or the dataframe or series.
In [228]:
crimes.columns
Out[228]:
Notice that some of the column names have spaces at the start or end of the name. Let's remove those so that
In [229]:
# remove white spaces
crimes.columns = crimes.columns.str.strip()
crimes.columns
Out[229]:
In [230]:
# replacing spaces with underscore
crimes.columns = crimes.columns.str.replace(' ', '_')
crimes.columns
Out[230]:
In [231]:
# We'll also remove the double "_" in DATE__OF_OCCURENCE
crimes.columns = crimes.columns.str.replace('__', '_')
crimes.columns
Out[231]:
The LOCATION
Column seems redundant, seeing that we also have X_COORDINATE
and Y_COORDINATE
columns. Let's drop it.
In [232]:
crimes.drop('LOCATION', axis=1, inplace=True)
In [233]:
crimes.columns
Out[233]:
In [234]:
crimes.describe()
Out[234]:
In [235]:
crimes.describe(include=['O'])
Out[235]:
In [236]:
crimes.isnull().sum()
Out[236]:
In [237]:
crimes['PRIMARY_DESCRIPTION'].head()
Out[237]:
In [238]:
#using . notation
crimes.PRIMARY_DESCRIPTION.head()
Out[238]:
In [239]:
# get value counts
crimes.PRIMARY_DESCRIPTION.value_counts()
Out[239]:
In [240]:
# selecting two columns
crimes[['PRIMARY_DESCRIPTION', 'SECONDARY_DESCRIPTION']].head()
Out[240]:
In [241]:
#subset by row index
crimes.PRIMARY_DESCRIPTION[3:10]
Out[241]:
In [242]:
#Use the iloc method
crimes.iloc[10:20,4:6]
Out[242]:
In [243]:
#Create a boolean series based on a condition
theft_bool = crimes['PRIMARY_DESCRIPTION']=='THEFT'
theft_bool
Out[243]:
In [244]:
#now pass that series to the datafram to subset it
theft = crimes[theft_bool]
theft.head()
Out[244]:
In [245]:
#now pass that series to the datafram to subset it
theft = crimes[crimes['PRIMARY_DESCRIPTION']=='THEFT']
theft.head()
Out[245]:
In [274]:
crimes[(crimes['PRIMARY_DESCRIPTION']=='CRIMINAL DAMAGE')].head()
Out[274]:
In [275]:
crimes[(crimes['PRIMARY_DESCRIPTION']=='CRIMINAL DAMAGE')&(crimes['SECONDARY_DESCRIPTION']=='TO PROPERTY')].head()
Out[275]:
In [246]:
theft.sort_values('DATE_OF_OCCURRENCE', inplace=True, ascending=False)
theft.head()
Out[246]:
In [247]:
theft.sort_values('DATE_OF_OCCURRENCE', inplace=True, ascending=True)
theft.head()
Out[247]:
Hmmm. Something isn't right about how this is sorting. Let's look into it.
In [248]:
theft.dtypes
Out[248]:
Right now the dates are objects. To ensure they're handled correctly, they should be datetime. Let's fix that!
In [250]:
theft.DATE_OF_OCCURRENCE = pd.to_datetime(theft.DATE_OF_OCCURRENCE)
In [251]:
theft.head()
Out[251]:
In [252]:
theft.sort_values('DATE_OF_OCCURRENCE', inplace=True, ascending=True)
theft.head()
Out[252]:
You can see that the row labels for the first 5 rows are NOT 0, 1, 2, 3, and 4. If we wanted to select the first five rows, we can use DataFrame.iloc[]
method to select by position. If you want to select the rows with labels 0 through 4, you would use DataFrame.loc[]
.
The easy way to remember which is which is to remember that iloc[]
stands for integer location, because you use integers and not labels to select the data.
In [253]:
#print the first five rows of theft data
theft.iloc[0:5]
Out[253]:
In [254]:
#print first ten rows of theft data
theft.iloc[0:10]
Out[254]:
In [255]:
#print the rows with index label 12
theft.loc[12]
Out[255]:
In [256]:
# print the row at the fifth position
theft.iloc[4]
Out[256]:
In [257]:
scores = pd.read_csv('fandango_score_comparison.csv')
In [258]:
scores.head()
Out[258]:
In [259]:
scores.describe()
Out[259]:
In [260]:
scores.info()
In [261]:
scores.IMDB.mean()
Out[261]:
In [262]:
scores.IMDB.describe()
Out[262]:
In [263]:
max_IMDB = scores.IMDB.max()
max_IMDB
Out[263]:
In [264]:
min_IMDB = scores.IMDB.min()
min_IMDB
Out[264]:
In [266]:
# Return the list of movies with the lowest score:
scores[scores.IMDB == min_IMDB]
Out[266]:
In [265]:
#Return the list of movies with the highest score:
scores[scores.IMDB == max_IMDB]
Out[265]:
In [267]:
# Movies with the highest RottenTomatoes rating
scores[scores.RottenTomatoes == scores.RottenTomatoes.max()]
Out[267]:
In [269]:
# Movies with the lowest RottenTomatoes rating
scores[scores.RottenTomatoes == scores.RottenTomatoes.min()]
Out[269]:
In [ ]:
Now we can plot the series with ease!
In [150]:
crimes.groupby('PRIMARY_DESCRIPTION').size()
Out[150]:
In [136]:
crimes.groupby(['PRIMARY_DESCRIPTION', 'SECONDARY_DESCRIPTION']).size()
Out[136]:
In [ ]:
In [151]:
crimes.head()
Out[151]:
In [156]:
crimes.DATE__OF_OCCURRENCE = pd.to_datetime(crimes.DATE__OF_OCCURRENCE)
In [165]:
crimes['year'] = crimes.DATE__OF_OCCURRENCE.map(lambda x: x.year)
In [164]:
crimes['month'] = crimes.DATE__OF_OCCURRENCE.map(lambda x: x.month)
In [168]:
month_year_crimes = crimes.groupby(['year', 'month']).size()
month_year_crimes
Out[168]:
In [169]:
month_year_crimes.plot()
Out[169]:
In [170]:
month_year_crimes.hist()
Out[170]:
In [ ]: