In this introductory course to data science we will start by introducing the basics of the discipline. In this first part of the course we will explain the first steps towards working with data and the basic theoretical concepts needed to analyse a dataset.
In [1]:
#!conda install -y numpy pandas matplotlib seaborn statsmodels ipywidgets
In [2]:
%matplotlib inline
import seaborn as sns
import pandas as pd
In [3]:
sns.set(style="ticks")
We will be using a dataset called the Anscombe dataset that can be loaded from the seaborn examples.
In [4]:
df = sns.load_dataset("anscombe")
And df
is a pandas dataframe...
In [5]:
type(df)
Out[5]:
that we can print, plot, ...
This dataset is comprised of three columns. Two of them, called 'x' and 'y', are filled with numerical values while the 'dataset' column is filled with string values. Each of these values contained in the 'dataset' column will be either 'I','II','III','IV'. This means that the dataset column is a categorical value.
We can visualize the first five rows of a DataFrame using the head() function. We also have a column comprised of integer values that we will use as an index.
In [6]:
df.head()
Out[6]:
When dealing with a new dataset we should start by doing some data exploration. This means that we should try to get an intuitive idea of how the data is structured, and how the different columns are related to each other.
As we don't know anything about the dataset we will start by plotting the numerical values contained in each column.
The histogram is a plot that is used to display the values that a single column takes. An histogram consists in several vertical bars displayed in parallel, where the x axis represents the values that a point in the column can take, and its height is proportional to the number of times that the value appears in the column.
In [7]:
import matplotlib.pyplot as plt
with plt.xkcd():
_=df.plot(kind='hist',title='Histograms of x and y',figsize=(12,5),bins=6,fontsize=20,subplots=True)
#plt.xlabel(,fontsize=16)
A histogram allows us to see how many items of a column have the same range of values, but it doesn't allow us to relate the values contained in one column of our dataset to the values contained in a another column for a given row. The histogram doesn't take into account the ordering of the data.
In order to see how the ordering of the data affects its value we can check the following plot:
In [8]:
df[['x','y']].plot( linestyle='--', marker='o',figsize=(15,5))
plt.title("Anscombe's points",fontsize=25)
plt.xlabel('Index',fontsize=16)
_ = plt.ylabel('Value',fontsize=16)
This is called a line plot. In these plots we will represent the values of different columns in the y axis relative to its index value, represented in the x axis. This kind of plot helps us understand how the index value (and thus the order) affects the structure of the data.
This plot also allows us to compare the values of different columns at a given moment (index value).
As we can see, this plot doesn't give us any intuitive idea of how the data is structured. This probably means that the index value does not have a meaning impact on how the data is structured.
Given that the two columns do not seem to be related to the index value, we will try to figure out if they are related to each other. One graphical way to see how the two series of points are related to each other is by means of a scatter plot.
In this plot, instead of plotting the value of a column against an index, we will plot it against the value the other column have at a given index. In this plot the ordering of the data doesn't matter (you cannot see the information contained in the index column in this plot.)
In [9]:
df.plot(kind='scatter',x='x',y='y',fontsize=14,s=75)
plt.title("Anscombe's points",fontsize=20)
plt.xlabel('x',fontsize=16)
plt.grid()
_ = plt.ylabel('y',fontsize=16)
This is a scatter plot. This speciffic plot answers the following question:
When a point of the data set in the column 'x' had a value of x, which was the value y contained in the column 'y'? (or vice versa).
In order to answer the question with the plot we only have to substitue the value x for the number we want, and search for that number in the lower axis of the plot, labeled x. The answer y will be the height of the points that lie in an straight parallel to the grid and starts at x.
Now, at least we can infer some structure in our data. In this plot it almost seems as if the points were more or less distributed across an imaginary straight line.
Since now, we have not paid much attention to the 'dataset' column. This column is comprised of four different values:
In [10]:
df['dataset'].unique()
Out[10]:
It is possible to interpret the values contained in this column as labels attached to each row. This means that a categorical value can be interpreted as a column that allows us to break our dataset in different subsets. Each of the elements of these diferent "mini datasets" will have the same value. This is what is known as *grouping a data set by a categorical value".
We can also see that these labels represent numbers, so we will add a column to our database representing the same numeric value in decimal base. This way we can represent the information contained in the labels without getting errors in the pandas plotting interface.
In [11]:
number_mapping = lambda x: x.count('I') if x.count('V')==0 else 4#apply a function value by value to all
df['num_ix'] = df['dataset'].map(number_mapping)#the values contained a column
df.head(3)#change the number of displayed rows
Out[11]:
In order to represent the category we will assign a different color to each different value. This means defining a color map between the label values and its visual representation. One of the possible ways of doing it is by manually defining a colormap.
In [12]:
from matplotlib.colors import LinearSegmentedColormap
vmax=3
cmap = LinearSegmentedColormap.from_list('mycmap', [(0/vmax ,'blue'),
(1/vmax, 'green'),
(2/vmax, 'red'),
(3/vmax, 'yellow')]
)
Now that we are ready to plot,we will see how the different categorical values are distributed across the dataset with the kind of plot we already know: the scatter plot.
We will make two different scatter plots. First we will plot the value of one numerical-valued column versus the index value, and we will color each point according to its 'dataset' category.
In [13]:
df.reset_index().plot(kind='scatter',
x='index',
y='x',
fontsize=14,
s=150,c='num_ix',
cmap=cmap,figsize=(15,7),alpha=0.5)
plt.title("Anscombe's points",fontsize=20)
plt.grid()
plt.ylabel('x',fontsize=16)
plt.xlabel('index',fontsize=16)
Out[13]:
We can see in this plot how the index actually represents the different classes ordered. This supports our initial hypothesis that the index is not related to the 'x' and 'y' columns. Now lets take a look at how the coloured scatter plot of x vs y looks.
In [14]:
def color_scatter(df):
df.plot(kind='scatter',x='x',y='y',fontsize=14,s=150,c='num_ix',cmap=cmap,figsize=(15,7),alpha=0.5)
plt.title("Anscombe's points",fontsize=20)
plt.xlabel('x',fontsize=16)
plt.grid()
_ = plt.ylabel('y',fontsize=16)
color_scatter(df)
We can clearly see that our Anscombe's dataset actually consists in four different datasets all added up together in the same DataFrame, and each of the category labels are used to separate each different dataset.
In [15]:
sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=df,
col_wrap=2, ci=None, palette="muted", size=4,
scatter_kws={"s": 50, "alpha": 1}, fit_reg=False)
Out[15]:
Now its time to analise each of the datasets conforming our dataframe independently. And we will start by selecting the data set labeled 'I'.
In [16]:
I = df[df.dataset == 'I'].copy()
I
Out[16]:
In [17]:
len(df)
Out[17]:
We can see that the new dataset contains 10 points. Now let's plot the "I" dataset to see how the points are distributed
In [18]:
color_scatter(I)
In [19]:
with plt.xkcd():
_ =I[['x','y']].plot(kind='hist',title='Histograms of x and y',figsize=(12,5),bins=6,fontsize=20,sharey=True,subplots=True)
It is not easy to infer something meaningfull at first sight from this dataset. We can tell that the points seem to be distributed around an imaginary line, but it is something that is not easy to describe with words.
Statistics is a way of describing a dataset, but instead of using plots we are using numbers. That numbers will represent properties of our data set, and they are a way to quantify certain properties of a data set. you can think of it as a language to describe dataset where plots are not available.
We will try to define a quantity that tells us how likely is to find a data point in any point (x,y) from our scatter plot, and we will call this quantity density.
This concept is closely related to the histograms, as the higher the frecuency count the higher the density will be, as these magnitudes are proportional.
That would mean that the density is intuitively the height that an histogram would have at a given point. This makes the density of a 1 dimensional data set really easy to visualize using a histogram, but it is a little more tricky if we try to do this in 2D, as we will se later.
In [20]:
df.describe()
Out[20]:
In [21]:
import numpy as np
from ipywidgets import interact, interactive, fixed
from IPython.core.display import clear_output,display
import ipywidgets as widgets
def draw_example_dist(n=10000):
a = np.random.normal(loc=-15.,scale=3,size=n)
b = np.random.normal(loc=20,scale=3,size=n)
c = np.random.exponential(size=n,scale=3)
return np.hstack([a,b,c])
example = draw_example_dist()
def draw_regions(distribution='normal',
bins=7,
bw=0.18,
normed=False,
mean=False,
std=False,
percents=False,
hist=True):
x = draw_example_dist() if distribution=='example_2' else np.random.standard_normal(10000)
x = draw_regions_data if distribution=='custom' else x
fig = plt.figure(figsize=(14,8))
#ax = fig.add_subplot(1, 1, 1)
ax = sns.kdeplot(x, cut=0,color='b',shade=True,alpha=0.3,bw=bw)
d_bins = np.linspace(x.min(),x.max(),num=bins)
if hist:
n, bins, patches = ax.hist(x,bins=d_bins,normed=normed,rwidth=0.8,alpha=0.7)
else:
n=[1]
maxx = 1 if normed else max(n)
if mean:
ax.axvline(x=x.mean(), ymin=0, ymax = maxx,
linewidth=6, color='r',label='Mean: {:.3f}'.format(x.mean()),alpha=1.)
if std:
m = x.mean()
ax.axvline(x=m+x.std(), ymin=0, ymax = maxx, linewidth=5, color='g',label='Std: {:.3f}'.format(x.std()),alpha=0.8)
ax.axvline(x=m-x.std(), ymin=0, ymax = maxx, linewidth=5, color='g',alpha=0.8)
if percents:
d = pd.Series(x).describe()
ax.axvline(x=d.ix['min'],
ymin=0, ymax = maxx,
linewidth=5,
color=(0.19130826141258903, 0.13147472185630074, 0.09409307479747722),
label='min: {:.2f}'.format(d.ix['min']),alpha=0.8)
ax.axvline(x=d.ix['25%'],
ymin=0, ymax = maxx,
linewidth=5,
color=(0.38717148143023966, 0.26607979423298955, 0.19042646089965626),
label='25%: {:.2f}'.format(d.ix['25%']),alpha=0.8)
ax.axvline(x=d.ix['50%'],
ymin=0, ymax = maxx,
linewidth=5,
color=(0.5830347014478903, 0.4006848666096784, 0.2867598470018353),
label='50%: {:.2f}'.format(d.ix['50%']),alpha=0.8)
ax.axvline(x=d.ix['75%'],
ymin=0, ymax = maxx,
linewidth=5,
color=(0.7743429628604792, 0.5321595884659791, 0.3808529217993126),
label='75%: {:.2f}'.format(d.ix['75%']),alpha=0.8)
ax.axvline(x=d.ix['max'],
ymin=0, ymax = maxx,
linewidth=5,
color=(0.9415558823529412, 0.663581294117647, 0.47400294117647046),
label='max%: {:.2f}'.format(d.ix['max']),alpha=0.8)
# ax.plot((m-0.1, m+0.1), (0,max(n)), 'k-')
plt.grid(linewidth=2)
plt.title("Basic statistics",fontsize=20)
plt.legend(loc='upper left',fontsize=18)
plt.show()
clear_output(True)
display(interact(draw_regions,bins=(1,100,2),bw=(0.01,10.,0.025),
normed=False,distribution=['Normal','exmple_2','custom']))
In [22]:
#from shaolin import KungFu
def draw_regions(x_regions,y_regions):
I = df
x,y = I.x,I.y
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(1, 1, 1)
ax.scatter(x,y)
plt.xticks(np.linspace(x.min(),x.max(),num=x_regions))
plt.yticks(np.linspace(y.min(),y.max(),num=y_regions))
plt.grid(linewidth=2)
plt.show()
clear_output(True)
display(interact(draw_regions,x_regions=(0,30), y_regions=(0,30)))
In [23]:
def density_plot(linreg_order=1,confidence_intervals=95,bandwidth=1.43,**kwargs):
if False:# 'bandwidth' not in kwargs.keys():
kwargs['bandwith'] = 1
data=jointplot_data
clear_output()
if kwargs['dataset']=='ALL':
subdf = data.copy()
else:
subdf = data[data['dataset']==kwargs['dataset']].copy()
x,y = subdf.x,subdf.y
x_regions = 10
y_regions = 10
x_bins = np.linspace(x.min(),x.max(),num=kwargs['num_xbins'])
y_bins = np.linspace(y.min(),y.max(),num=kwargs['num_ybins'])
g = sns.JointGrid(x="x", y="y", data=subdf)
g.fig.set_figwidth(14)
g.fig.set_figheight(9)
if kwargs['plot_density']:
g = g.plot_joint(sns.kdeplot, shade=True,alpha=0.5,legend=True,
bw=bandwidth,
gridsize=int((len(x_bins)+len(y_bins))/2),
clip=((x.min(),x.max()),(y.min(),y.max())))
if kwargs['scatter']:# and not self.kwargs['regression']:
g = g.plot_joint(plt.scatter)
if kwargs['marginals'] in ['Histogram','Both']:
_ = g.ax_marg_x.hist(x, alpha=.6,
bins=x_bins,normed=True)
_ = g.ax_marg_y.hist(y, alpha=.6,
orientation="horizontal",
bins=y_bins,normed=True)
if kwargs['marginals'] in ['KDE','Both']:
clip = ((x.values.min()-0.1,x.values.max()+0.1),(y.values.min()-0.1,y.values.max()+0.1))
g = g.plot_marginals(sns.kdeplot, **dict(shade=True))#,bw=kwargs['bandwith']))#,
#gridsize=int((len(x_bins)+len(y_bins))/2)))
#clip=clip)
if kwargs['regression']:
g = g.plot_joint(sns.regplot,
truncate=True,
order=linreg_order,
ci=confidence_intervals,
scatter_kws={"s": 80}
)
if kwargs['grid']:
_ = plt.grid(linewidth=2)
_ = plt.xticks(x_bins)
plt.yticks(y_bins)
_ = plt.xlim(x.values.min(),x.values.max())
_ = plt.ylim(y.values.min(),y.values.max())
_ = plt.show()
if kwargs['save']:
_ =plt.savefig("data/density_plot_{}.png".format(0), dpi=100)
In [24]:
kwargs= {'num_xbins':(1,50,2),
'num_ybins':(1,50,2),
'bandwidth':(0.001,10,0.025),
'grid': True,
'plot_density':False,
'scatter':True,
'regression':False,
'linreg_order':(1,5,1),
'confidence_intervals':(1,100,1),
'marginals':['None','Histogram','KDE','Both'],
'dataset':['ALL','I','II','III','IV'],
'save':False,
#'@kernel':['gau','cos','biw','epa','tri','triw']
}
jointplot_data = df
display(interact(density_plot,**kwargs))
In [24]:
import numpy as np
df['new_index'] = df.groupby('dataset').transform(lambda x:np.arange(len(x)))['x'].values
idx = pd.IndexSlice
sep = df.pivot(columns='dataset',index='new_index')
sep.loc[:,idx[:,'I']]
Out[24]:
In [25]:
sep
Out[25]:
In [26]:
groups = ['I', 'II', 'III', 'IV']
for group in groups:
print(group)
print(df[df.dataset == group].describe())
print()
Let's compare the correlation coefficient for each dataset
In [27]:
for g in groups:
print(df[df.dataset == g]['x'].corr(df[df.dataset == g]['y']))
In [ ]:
Plot datasets
In [35]:
import matplotlib.cm as cmx
import matplotlib.pyplot as plt
groups = df.groupby('dataset')
# Plot
fig, ax = plt.subplots()
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
for name, group in groups:
ax.plot(group.x, group.y, marker='o', linestyle='', ms=12, label=name)
ax.legend(loc=2)
plt.title("Anscombe's points",fontsize=20)
plt.xlabel('x')
plt.ylabel('y')
plt.show()
In [29]:
g = sns.jointplot(x="x", y="y",data=df[df['dataset']=='I'], kind="kde", size=7, space=0)
In [30]:
_=sep.plot(kind='hist',subplots=True,figsize=(18,10),layout=(3,4),sharex=False,rwidth=0.8)
Show the results of a linear regression within each dataset
In [31]:
sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=df,
col_wrap=2, ci=None, palette="muted", size=4)
Out[31]:
It's the same line for all datasets
Let's plot with its 95% confidence interval region.
In [51]:
sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=df,
col_wrap=2, ci=95, palette="muted", size=4)
Out[51]:
Visualize your data beforehand
One can fit a polynomial regression model to explore simple kinds of nonlinear trends in the dataset
In [52]:
sns.lmplot(x="x", y="y", data=df[df.dataset == 'II'],
order=2, ci=95, scatter_kws={"s": 80});
In the presence of outliers, it can be useful to fit a robust regression, which uses a different loss function to downweight relatively large residuals:
In [53]:
sns.lmplot(x="x", y="y", data=df[df.dataset == 'III'],
robust=True, ci=None, scatter_kws={"s": 80});
In [ ]:
In [ ]: