Working with data 2017. Class 3

Contact

Javier Garcia-Bernardo garcia@uva.nl

0. Structure

  1. Error debugging
  2. Data visualization theory
    • Scatter
    • Histograms, violinplots and two histograms (jointplot)
    • Line plots with distributions (factorplot)
    • Paralell coordinates
  3. Dealing with missing data
  4. In-class exercises to melt, pivot, concat and merge
  5. Groupby and in-class exercises
  6. Stats
    • What's a p-value?
    • One-tailed test vs two-tailed test
    • Count vs expected count (binomial test)
    • Independence between factors: ($\chi^2$ test)

In [77]:
##Some code to run at the beginning of the file, to be able to show images in the notebook
##Don't worry about this cell

#Print the plots in this screen
%matplotlib inline 

#Be able to plot images saved in the hard drive
from IPython.display import Image 

#Make the notebook wider
from IPython.core.display import display, HTML 
display(HTML("<style>.container { width:90% !important; }</style>"))

import seaborn as sns
import pylab as plt
import pandas as pd
import numpy as np

def read_our_csv():
    #reading the raw data from oecd
    df = pd.read_csv("../class2/data/CITIES_19122016195113034.csv",sep="\t")

    #fixing the columns (the first one is ""METRO_ID"" instead of "METRO_ID")
    cols = list(df.columns)
    cols[0] = "METRO_ID"
    df.columns = cols
    
    #pivot the table
    column_with_values = "Value"
    column_to_split = ["VAR"]
    variables_already_present = ["METRO_ID","Metropolitan areas","Year"]
    df_fixed = df.pivot_table(column_with_values,
                 variables_already_present,
                 column_to_split).reset_index()
    
    return df_fixed


2 Data visualization: A picture is worth a thousand words

Why do we visualize information?

  • It's easier to read than a table
  • We use it to:
    • Communicate information
    • Support our points

2.1 Example: Anscombe's quartet


In [3]:
#From Tufte "The visual display of information"
Image(url="images/tufle1.png")


Out[3]:

In [125]:
sns.set(style="ticks")

# Load the example dataset for Anscombe's quartet
df = sns.load_dataset("anscombe")

# Show the results of a linear regression within each dataset
sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=df,
           col_wrap=4, ci=None, palette="muted", size=4, 
           scatter_kws={"s": 50, "alpha": 1})


Out[125]:
<seaborn.axisgrid.FacetGrid at 0x7effa71098d0>

2.2 Principles of data visualization for quantitative information

  • You can use different channels

  • Some channels are easily interpreted by our brain

  • Some can be combined better than others

2.2.1 Channels to map information in a figure


In [19]:
#From http://www.cs171.org/2015/assets/slides/05-marks_channels.pdf
Image(url="images/channels.png",width=1000)


Out[19]:

2.2.2 Relative errors of different channels


In [20]:
#From http://www.cs171.org/2015/assets/slides/05-marks_channels.pdf
Image(url="images/cleveland.png",width=1000)


Out[20]:

In [16]:
#https://en.wikipedia.org/wiki/Stevens'_power_law
#From http://www.cs171.org/2015/assets/slides/05-marks_channels.pdf
Image(url="images/steven.png",width=500)


Out[16]:

In [126]:
plt.figure(figsize=(4,3))
plt.bar([1,2],[1,3.5],width=0.3)
#plt.axis('off')
plt.yticks([1,2,3,4])
plt.xticks([])
plt.tick_params(axis='both', left='off', top='off', right='off', bottom='off', labelleft='off', labeltop='off', labelright='off', labelbottom='off')
plt.grid("on")
plt.title("How much bigger?")


Out[126]:
<matplotlib.text.Text at 0x7effa6ef5160>

In [49]:
plt.scatter?

In [127]:
plt.figure(figsize=(4,3))
plt.scatter([1,1.1],[1,1],s=[500,1250])
plt.tick_params(axis='both', left='off', top='off', right='off', bottom='off', labelleft='off', labeltop='off', labelright='off', labelbottom='off')
plt.grid("on")
plt.title("How much bigger?")


Out[127]:
<matplotlib.text.Text at 0x7effa6e47198>

In [133]:
plt.figure(figsize=(4,3))
plt.bar([1,2],[2*np.sqrt(0.5),3.5],width=[np.sqrt(0.5),1])
plt.yticks([1,2,3,4])
plt.xticks([0.5,1,2,3,3.5])
plt.tick_params(axis='both', left='off', top='off', right='off', bottom='off', labelleft='off', labeltop='off', labelright='off', labelbottom='off')
plt.grid("on")
plt.title("How much bigger?")


Out[133]:
<matplotlib.text.Text at 0x7effa6c52dd8>

In [13]:
plt.figure(figsize=(4,3))
plt.bar([1,2],[2,2],width=[1,1],color=[(20/255,20/255,20/255),(100/255,100/255,100/255)])
plt.yticks([1,2,3,4])
plt.xticks([0.5,1,2,3,3.5])
plt.tick_params(axis='both', left='off', top='off', right='off', bottom='off', labelleft='off', labeltop='off', labelright='off', labelbottom='off')
plt.grid("on")
plt.title("How much darker?")


Out[13]:
<matplotlib.text.Text at 0x7ff8840e0748>

In [15]:
plt.figure(figsize=(4,3))
plt.bar([1,2],[2,2],width=[1,1],color=[(0/255,74/255,235/255),(59/255,96/255,176/255)])
plt.yticks([1,2,3,4])
plt.xticks([0.5,1,2,3,3.5])
plt.tick_params(axis='both', left='off', top='off', right='off', bottom='off', labelleft='off', labeltop='off', labelright='off', labelbottom='off')
plt.grid("on")
plt.title("How much more saturation?")


Out[15]:
<matplotlib.text.Text at 0x7ff88408c5f8>

2.2 Information about ONE quantitative variable

  • Distributions: Histogram, violinplot and box-plot
  • Mean values: Bar plot

2.2.1 DISTRIBUTIONS

  • The distribution is the relationship between the value and the frequency (value 7, frequency 100 times)
  • We usually plot the relative frequency (the fraction) instead of the value
  • When we do this, the area below the curve is equal to 1.

  • When are we interested on this:

    • In the description phase of our research, to see how our data looks like
    • To see if the assumptions of our statistics hold (histograms)
    • Many times we want to plot the distributions (or a summary of it) of a quantitative variable in terms of a qualitative variable (for instance, distribution of GDP in many countries)

In [18]:
#Data example: If you draw two dice, then you will get a lot of 7s, many 6s and 8s, some 5s and 9s, a few 4s and 10st, very few32s and 11st and almost no 2s ans 12st.
#This data is discrete

from collections import Counter

#Roll two dices 10000 times
dice_rolls = np.random.randint(1,7,10000) + np.random.randint(1,7,10000)
#Count the number of each element to create the distribution
Counter(dice_rolls)


Out[18]:
Counter({2: 279,
         3: 583,
         4: 838,
         5: 1136,
         6: 1318,
         7: 1648,
         8: 1388,
         9: 1118,
         10: 880,
         11: 538,
         12: 274})

2.2.1.1 HISTOGRAM

  • A representation of the visualization
  • It's not a good idea to have many of this

In [30]:
from scipy.stats import norm,lognorm,expon

#seaborn defaults
sns.set()
#And we can visualize it with a histogram
plt.figure(figsize=(4,3))

#Histogram
sns.distplot(dice_rolls, fit=norm, kde=False,rug=False,bins=range(2,14),norm_hist=True)


Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff88281a7b8>

In [112]:
!conda update pandas -y


Fetching package metadata ...An unexpected error has occurred.
Please consider posting the following information to the
conda GitHub issue tracker at:

    https://github.com/conda/conda/issues



Current conda install:

               platform : linux-64
          conda version : 4.3.7
       conda is private : False
      conda-env version : 4.3.7
    conda-build version : 2.0.2
         python version : 3.5.2.final.0
       requests version : 2.12.4
       root environment : /opt/anaconda/anaconda3  (writable)
    default environment : /opt/anaconda/anaconda3
       envs directories : /opt/anaconda/anaconda3/envs
          package cache : /opt/anaconda/anaconda3/pkgs
           channel URLs : https://repo.continuum.io/pkgs/free/linux-64
                          https://repo.continuum.io/pkgs/free/noarch
                          https://repo.continuum.io/pkgs/r/linux-64
                          https://repo.continuum.io/pkgs/r/noarch
                          https://repo.continuum.io/pkgs/pro/linux-64
                          https://repo.continuum.io/pkgs/pro/noarch
            config file : /home/jgarcia1/.condarc
           offline mode : False
             user-agent : conda/4.3.7 requests/2.12.4 CPython/3.5.2 Linux/3.10.0-327.18.2.el7.x86_64 CentOS Linux/7.2.1511 glibc/2.17
                UID:GID : 200044586:513



`$ /opt/anaconda/anaconda3/bin/conda update pandas -y`




    Traceback (most recent call last):
      File "/opt/anaconda/anaconda3/lib/python3.5/site-packages/conda/exceptions.py", line 617, in conda_exception_handler
        return_value = func(*args, **kwargs)
      File "/opt/anaconda/anaconda3/lib/python3.5/site-packages/conda/cli/main.py", line 137, in _main
        exit_code = args.func(args, p)
      File "/opt/anaconda/anaconda3/lib/python3.5/site-packages/conda/cli/main_update.py", line 65, in execute
        install(args, parser, 'update')
      File "/opt/anaconda/anaconda3/lib/python3.5/site-packages/conda/cli/install.py", line 210, in install
        unknown=index_args['unknown'], prefix=prefix)
      File "/opt/anaconda/anaconda3/lib/python3.5/site-packages/conda/core/index.py", line 120, in get_index
        index = fetch_index(channel_priority_map, use_cache=use_cache)
      File "/opt/anaconda/anaconda3/lib/python3.5/site-packages/conda/core/index.py", line 445, in fetch_index
        repodatas = _collect_repodatas(use_cache, urls)
      File "/opt/anaconda/anaconda3/lib/python3.5/site-packages/conda/core/index.py", line 433, in _collect_repodatas
        repodatas = _collect_repodatas_serial(use_cache, urls)
      File "/opt/anaconda/anaconda3/lib/python3.5/site-packages/conda/core/index.py", line 401, in _collect_repodatas_serial
        for url in urls]
      File "/opt/anaconda/anaconda3/lib/python3.5/site-packages/conda/core/index.py", line 401, in <listcomp>
        for url in urls]
      File "/opt/anaconda/anaconda3/lib/python3.5/site-packages/conda/core/index.py", line 141, in func
        res = f(*args, **kwargs)
      File "/opt/anaconda/anaconda3/lib/python3.5/site-packages/conda/core/index.py", line 391, in fetch_repodata
        with open(cache_path, 'w') as fo:
    PermissionError: [Errno 13] Permission denied: '/opt/anaconda/anaconda3/pkgs/cache/db552c1e.json'

2.2.1.2 BOX-PLOT

  • A summary of the distribution.
  • They show if the means of two\ distributions are different or not

In [116]:
#And we can visualize it with a histogram
plt.figure(figsize=(4,2))
sns.boxplot(dice_rolls,orient="h")


Out[116]:
<matplotlib.axes._subplots.AxesSubplot at 0x7effa7373160>

2.2.1.2 VIOLIN-PLOT

  • A summary of the distribution
  • Only makes sense for CONTINUOUS data

In [119]:
#And we can visualize it with a histogram
plt.figure(figsize=(4,2))
sns.violinplot(dice_rolls,orient="h",inner="quartiles")


Out[119]:
<matplotlib.axes._subplots.AxesSubplot at 0x7effac085f28>

2.2 BAR-PLOT

  • When are we interested on this:

    • We only are interested in one individual value either because
      • We only have the value (e.g. the number of car accidents)
      • We are not so interested in the distribution and we don't want to clutter the plot. E.g. the mean number of car accidents. We need error bars in this case!
  • This plot only make sense if we have many categories


In [141]:
plt.figure(figsize=(4,2))
#mean is the default
sns.barplot(dice_rolls,estimator=np.mean)


Out[141]:
<matplotlib.axes._subplots.AxesSubplot at 0x7effa6aea470>

2.2 Information about ONE quantitative variable + ONE or more qualitative variables

  • Histogram, violinplot and box-plot
  • Bar plot

In [78]:
df = read_our_csv()
df["C"] = df["METRO_ID"].apply(lambda x: x[:2])
df = df.loc[df["C"] == "IT"]
df.head()


Out[78]:
VAR METRO_ID Metropolitan areas Year CO2_PC ENTROPY_1000M EQU_HOU_DISP_INC GDP_PC GINI_INC GREEN_AREA_PC LABOUR_PRODUCTIVITY PCT_INTENSITY POP_DENS SPRAWL UNEMP_R C
1530 IT001 Rome 2000 10.36 NaN NaN 47836.13 NaN 251.93 122766.44 0.33 651.06 NaN 11.05 IT
1531 IT001 Rome 2001 NaN NaN NaN 48928.59 NaN 250.04 124464.71 0.38 655.97 NaN 9.98 IT
1532 IT001 Rome 2002 NaN NaN NaN 49323.37 NaN 248.15 122641.81 0.36 660.98 NaN 7.93 IT
1533 IT001 Rome 2003 NaN NaN NaN 48662.53 NaN 246.25 120883.80 0.46 666.08 NaN 8.03 IT
1534 IT001 Rome 2004 NaN NaN NaN 50172.53 NaN 244.34 122254.22 0.50 671.29 NaN 7.47 IT

2.2.1.1 BOX-PLOT


In [171]:
plt.figure(figsize=(6,4))
sns.boxplot(x="Metropolitan areas",y="GDP_PC",data=df,color="gray")
plt.xticks(rotation=45)
plt.show()


2.2.1.2 VIOLIN-PLOT


In [170]:
plt.figure(figsize=(6,4))

sns.violinplot(x="Metropolitan areas",y="GDP_PC",data=df,color="gray",inner="quartiles")
plt.xticks(rotation=45)
plt.show()


2.2.1.3 BAR-PLOT


In [169]:
plt.figure(figsize=(6,4))

sns.barplot(x="Metropolitan areas",y="GDP_PC",data=df,color="gray")
plt.xticks(rotation=45)
plt.show()


We can add an extra variable with hue (example for boxplot but it works for all)


In [79]:
df_2010_15 = df.loc[df["Year"].isin([2005,2010]),:]
plt.figure(figsize=(6,4))
sns.barplot(x="Metropolitan areas",y="GDP_PC",data=df_2010_15,hue="Year")
plt.xticks(rotation=45)
plt.show()



In [88]:
df_2010_15 = df.loc[df["Year"].isin([2005,2010]),:]
df_2010_15 = df_2010_15.sort_values(by="GDP_PC")
plt.figure(figsize=(6,4))
sns.barplot(x="Metropolitan areas",y="GDP_PC",data=df_2010_15,hue="Year")
plt.xticks(rotation=45)
plt.show()


2.3 Information about TWO quantitative variables

  • Scatter plot
  • Line plot
  • Heatmap

2.3.1 SCATTER PLOT

This is the most useful plot.

  • Used to visualize relationship between two variables

In [202]:
sns.lmplot(x="GDP_PC",y="UNEMP_R",data=df, fit_reg=False,size=4,aspect=1.4)
plt.show()


And we can add a trendline

  • default: Fit linear
  • order=2: Fit 2nd order polynomial
  • logx=True -> Fit exponential
  • robust=True -> Fit linear with outliers
  • lowess=True -> trend line
  • logistic=True -> fit logistic (y must be between 0 and 1)

In [217]:
plt.figure(figsize=(6,4))
sns.lmplot(x="GDP_PC",y="UNEMP_R",data=df, logx=True)
plt.show()


<matplotlib.figure.Figure at 0x7effa58988d0>

In [38]:
plt.figure(figsize=(6,4))
sns.lmplot(x="GDP_PC",y="UNEMP_R",data=df, lowess=True)
plt.show()


<matplotlib.figure.Figure at 0x7ff880a802b0>

And we can add the marginal distributions


In [220]:
sns.jointplot(x="GDP_PC", y="UNEMP_R", data=df,
              marginal_kws=dict(bins=20, rug=False, kde=True, kde_kws={"cut":0}),size=6,alpha=0.5)


/opt/anaconda/anaconda3/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[220]:
<seaborn.axisgrid.JointGrid at 0x7effa5d66630>

And we can bin the data


In [99]:
sns.jointplot(x="GDP_PC", y="UNEMP_R", data=df,kind="hex",
              marginal_kws=dict(bins=20, rug=False, kde=True, kde_kws={"cut":0}), gridsize = 15,size=6)

plt.subplots_adjust(top=0.9)
plt.suptitle('THIS IS A TITLE, YOU BET') # can also get the figure from plt.gcf()


/opt/anaconda/anaconda3/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[99]:
<matplotlib.text.Text at 0x7ff871450b00>

2.3.1 LINE PLOT

This is the most useful plot when the x variable is time.

  • Used to visualize relationship between two variables, one of them time

In [93]:
sns.tsplot(time="Year",unit="Metropolitan areas",value="UNEMP_R",data=df.reset_index(),estimator=np.mean)


Out[93]:
<matplotlib.text.Text at 0x7ff872a0f6d8>

In [89]:
sns.tsplot?

2.4 Information about TWO quantitative variables + ONE or more qualitative variables

  • Scatter plot
  • Scatter matrix
  • Heatmap: Correlation matrix

2.4.1 SCATTER PLOT


In [224]:
sns.lmplot(x="GDP_PC",y="UNEMP_R",hue="Metropolitan areas",data=df, fit_reg=False,size=4,aspect=1.4)
plt.show()


2.4.2 LINE PLOT

This is the most useful plot when the x variable is time.

  • Used to visualize relationship between two variables, one of them time

In [73]:
sns.tsplot(time="Year",unit="Metropolitan areas",value="UNEMP_R",condition="Metropolitan areas",data=df.reset_index())
plt.savefig("annoying_legend.pdf")



In [74]:
#How to move the legend out
sns.tsplot(time="Year",unit="Metropolitan areas",value="UNEMP_R",condition="Metropolitan areas",data=df.reset_index())
sns.plt.legend(loc='center left',bbox_to_anchor=(1,0.5))


Out[74]:
<matplotlib.legend.Legend at 0x7ff88010b128>

2.4.3 SCATTER MATRIX

This is useful to see the relationship between many variables


In [100]:
df_subset = df.loc[:,["Metropolitan areas","CO2_PC","GDP_PC","GREEN_AREA_PC","POP_DENS","UNEMP_R"]].dropna()

In [76]:
sns.pairplot(df_subset.dropna(),hue="Metropolitan areas")


Out[76]:
<seaborn.axisgrid.PairGrid at 0x7ff880184668>

2.4.4 HEATMAP

This is useful to see the correlation between many variables


In [102]:
df_subset.head()


Out[102]:
VAR Metropolitan areas CO2_PC GDP_PC GREEN_AREA_PC POP_DENS UNEMP_R
1530 Rome 10.36 47836.13 251.93 651.06 11.05
1535 Rome 9.98 50376.28 242.42 676.60 7.35
1538 Rome 9.11 50748.42 236.61 693.20 7.04
1545 Milan 7.61 53499.91 24.21 1458.80 5.10
1550 Milan 8.01 55725.72 23.59 1496.90 4.19

In [105]:
corr = df_subset.corr()
corr


Out[105]:
VAR CO2_PC GDP_PC GREEN_AREA_PC POP_DENS UNEMP_R
VAR
CO2_PC 1.000000 0.294134 -0.124358 -0.255233 -0.414950
GDP_PC 0.294134 1.000000 0.485565 -0.399848 -0.827301
GREEN_AREA_PC -0.124358 0.485565 1.000000 -0.504489 -0.425216
POP_DENS -0.255233 -0.399848 -0.504489 1.000000 0.515049
UNEMP_R -0.414950 -0.827301 -0.425216 0.515049 1.000000

In [106]:
corr**2


Out[106]:
VAR CO2_PC GDP_PC GREEN_AREA_PC POP_DENS UNEMP_R
VAR
CO2_PC 1.000000 0.086515 0.015465 0.065144 0.172184
GDP_PC 0.086515 1.000000 0.235774 0.159878 0.684426
GREEN_AREA_PC 0.015465 0.235774 1.000000 0.254509 0.180809
POP_DENS 0.065144 0.159878 0.254509 1.000000 0.265276
UNEMP_R 0.172184 0.684426 0.180809 0.265276 1.000000

In [264]:
# Compute the correlation matrix
corr = df_subset.corr()

# Generate a mask for the upper triangle (hide the upper triangle)
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, square=True,linewidths=.5)

plt.show()


2.5 Network data

Use gephi: https://gephi.org/

This is what we do at the corpnet group (corpnet.uva.nl)


In [189]:
Image(url="images/newtork.png")


Out[189]:

2.6 When to use log scale

  • Increase visibility (too many values with small values)
  • When we are plotting ratios or percentages (because a ratio of 5/1 and a ratio of 1/5 look equally far from 1 in log scale)
  • In a distribution: When we are trying to show that our distribution follows a exponential (lin-log scale), lognormal (log-lin scale) or power-law (log-log scale) distribution

In [111]:
plt.bar([1,2],[5,0.2])
plt.plot([1,3],[1,1])
plt.yscale("log")