With your best knowledge, write a python script, or write an algorithm in words, that:
loads a file from a website: https://raw.githubusercontent.com/TeachingDataScience/datasets/master/nyt1.csv
from that file, counts the number of 1s and 0s under the 'Gender' column, and the number of each age under the 'Age' column
At a minimum, click on the link, look at the general layout of the data, and write out some notes as to how you'd go about a solution.
We'll be using the following practices to take advantage of the various parts of a Jupyter notebook:
With these two cell types we'll take notes for the workshop today, using the following steps:
In [ ]:
x = 7
print x + 5
# This is a comment! Comments are super helpful!
y = 2
print x / y
print float(x) / float(y)
help(x)
In [ ]:
some_string1 = 'apples'
some_string2 = 'and'
some_string3 = 'bananas'
print some_string1, some_string2, some_string3
print some_string1 + some_string2 + some_string3
mutable_list = ["apple", "apple", "banana", "kiwi", "bear", "strawberry", "strawberry"]
immutable_tuple = ("apple", "apple", "banana", "kiwi", "bear", "strawberry", "strawberry")
print len(some_string1)
print len(mutable_list)
print len(immutable_tuple)
print some_string1[0:5]
print mutable_list[0:4]
print immutable_tuple[5:6]
try:
some_string1[5] = 'd'
except TypeError as e:
print e
mutable_list[5] = 'mango'
try:
immutable_tuple[5] = 'not going to work'
except TypeError as e:
print e
a = [3 for i in range(10)]
print a
Data Scientists use a wide variety of libraries in Python that make working with data significantly easier. Those libraries primarily consist of:
1.numpy
2.scipy
3.pandas
4.matplotlib
5.statsmodels
6.scikit-learn
7.nltk
Though there are countless others available.
For today, we'll primarily focus ourselves around the library that is 99% of our work: pandas
pandas is a library built on top of numpy, which allows us to use excel-like matrices in the python programming space. These special matrices are called DataFrames, the primary object in pandas.
Earlier we loaded up the csv file from our computer; pandas can also parse from a URL.
In [ ]:
nyt = pd.read_csv('https://raw.githubusercontent.com/TeachingDataScience/datasets/master/nyt1.csv')
Like with everything else in python, the DataFrame is an object. We'll use a function called type to identify the object name:
In [ ]:
print type(nyt)
This translates to:
Objects in python are filled with variables and functions, and we use dot notation to access them.
# object.variable
nyt.dtypes
# object.function()
nyt.describe()
One great advantage with iPython is it has tab completion which means we can type nyt., press tab, and it'll show us what variables and functions exist.
Practice this with the following code cell with the following steps:
In [ ]:
# example:
print nyt.dtypes
nyt.dtypes?
print type(nyt.dtypes)
In [ ]:
# Try wrapping type with the following:
print nyt.describe
print nyt.describe()
print nyt.shape
print nyt.index
print nyt.columns
print nyt.groupby
print nyt.groupby('Age')
print nyt.Age
Practice the following steps with these 4 data sets located at http://teachingdatascience.github.io/Rdatasets/datasets.html.
Copy the csv link and use that with read_csv to import it.
dataframe['column'].unique() anddataframe.groupby('column')['column'].count()What interesting data point do we learn when we run the following code?
In [ ]:
print nyt.groupby(['Signed_In', 'Gender']).Age.describe()
In the Signed_In 0 group, we notice both Age and Gender are also always 0. We know this by:
This intuitively makes sense when working with marketing data: if a user is not logged in, you likely do not know their age or gender either.
Data does not always come in forms that we expect, so it is generally a requirement for us to work through a process called "data munging," which is the process of extracting and cleaning up a data set. Given that, we will explore the basics of data munging and aggregation:
Much of this fits into the Split-Apply-Combine strategy of data analysis, popularized by Hadley Wickham's R package plyr (and later, dplyr). In fact, many of these concepts are shared between R and pandas, the primary difference being the syntax.
While we go through each of these examples, take good notes, comment through the code, and write questions that we can come back to.
In [ ]:
nyt_signedin_only = nyt[nyt.Signed_In == 1]
nyt_signedin_only = nyt[nyt['Signed_In'] == 1]
nyt_signedin_only.groupby('Gender').describe()
Practice filtering for the following:
Earlier we selected columns (or a list of columns) using key notation (like with dictionaries). Pandas also accepts dot notation (akin to an object mapper, or javascript object notation).
To select rows based on the index, we can use .ix[], which accepts either keys (when the index is a string) or a range (when the index is integer based). (pandas does allow more specific typing, loc and iloc)
Otherwise, passing a range in will do fine.
In [ ]:
# Finding the first ten rows
print nyt[0:10]
# This will still return the first ten rows
print nyt.set_index('Age')[0:10]
# Since the index is positional, this will also return the first 10 rows.
print nyt.ix[0:10]
# This first one will work fine, as it returns back all rows where Age == 40.
# The second will fail, as .ix does not treat the index as positional, when the index does not represent the row order.
print nyt.set_index('Age').ix[40]
try:
print nyt.set_index('Age').ix[40:45]
except Exception as e:
print e.message
# Using dot notation and returning uniques of that column
print nyt.Age.head()
print nyt.Age.unique()
There are three most common techniques in mutating or transforming the data.
One approach would be to directly use values from other columns to derive new columns. The other would be to apply functions to columns using a named or lambda function.
You can generate the new columns by defining their name as a key.
In [ ]:
nyt['column_of_ones'] = 1
nyt['saw_ad_many_times'] = nyt['Impressions'].apply(lambda x: 1 if x > 5 else 0)
def saw_add_func(x):
if x > 5:
return 1
else:
return 0
nyt['saw_ad_many_times'] = nyt['Impressions'].apply(saw_add_func)
# practice on one column:
# a common approach to missing data, let's either set the values to null, or -1:
# import numpy as np
# np.nan
# by default apply works with a single column, however you can use it to iterate by row as well (though it is slower)
nyt['not_signed_in'] = nyt.apply(lambda row: 0 if row['Signed_In'] else 1, axis=1)
In [ ]:
# import numpy as np
nyt['column_of_ones'] = 1
nyt['Click_Thru'] = nyt['Clicks'] / nyt['Impressions']
nyt['Click_Thru'] = nyt['Click_Thru'].apply(lambda x: 0 if x == np.inf else x)
nyt_signedin_only = nyt[nyt.Signed_In == 1]
nyt_group = nyt.groupby('Age')
print nyt_group.mean()
In [ ]:
print nyt_group['Click_Thru'].mean()
In [ ]:
print nyt_group.agg({
'Click_Thru': ['mean', 'max'],
'Gender': 'mean',
})
In [ ]:
print nyt[nyt.Signed_In == 1].pivot_table(
values='column_of_ones',
index='Gender',
columns='Age',
aggfunc='count'
)
Practice:
matplotlib's core functionality serves as a plotting tool within python. While calling .describe() on DataFrames is useful to get a rough idea of what your data looks like, plots allow you to visualize what your data really looks like. For today, we'll use matplotlib within the context of pandas.
Consider the following data set and code:
In [ ]:
anscombe = pd.DataFrame({
'x' : [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5],
'y1' : [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68],
'y2' : [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74],
'y3' : [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73],
'x4' : [8,8,8,8,8,8,8,19,8,8,8],
'y4' : [6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.50,5.56,7.91,6.89],
})
anscombe.describe()
Visually from creating the data frame you can tell the data looks different, yet in the .describe() call, the data shares very similar features. The two primary plotting tools we uses from matplotlib are histograms and scatterplots, which help us understand the shape of data.
In [ ]:
%matplotlib inline
# above magic allows plots to show in our notebook.
for y in ['y1', 'y2', 'y3', 'y4']:
if y != 'y4':
print anscombe.plot(kind='scatter', x='x', y=y)
else:
print anscombe.plot(kind='scatter', x='x4', y=y)
Use these new tools in order to visualize some of this New York Times ad performance data.
In [ ]:
nyt[nyt.Signed_In == 1].Age.hist()
In [ ]:
print nyt[nyt.Signed_In == 1].Clicks.hist()
In [ ]:
nyt.describe()
In [ ]:
nyt['Click_Thru'] = nyt['Clicks'] / nyt['Impressions']
nyt[nyt.Signed_In == 1].plot(kind='scatter', x='Age', y='Click_Thru')
Practice doing much of the same functionality we did today with two of the four different data sets from the repo above in small groups of 3 or 4.
Primarily, your goals are to:
Summarise your work with the following questions:
We will come back as a class and have each group present their answers to the above to one of the data sets they explored.
I encourage you to pick up and start reading the following books to continue learning about programming and data analytics.
Still uncomfortable with Python?
That's okay! Continue practicing your python basics with Learn Python the Hard Way. It'll strengthen your chops. Do it multiple times—you'll keep learning!
Feel like Pandas can revolutionize how you work?
Purchase and read Wes McKinney's Python for Data Analysis. It'll get you cranking on everything Wes did with Pandas and why it was designed the way it was.
Want to dig into why Data Science?
Check out Provost and Fawcett's Data Science for Business. You'll start understanding more about the business practice of Data Science (vs the academic side) and some fundamentals about algorithm and stats application.