Working with data 2017. Class 3

Contact

Javier Garcia-Bernardo garcia@uva.nl

0. Structure

Error debugging
Data visualization theory
- Scatter
- Histograms, violinplots and two histograms (jointplot)
- Line plots with distributions (factorplot)
- Paralell coordinates
Dealing with missing data
In-class exercises to melt, pivot, concat and merge
Groupby and in-class exercises
Stats
- What's a p-value?
- One-tailed test vs two-tailed test
- Count vs expected count (binomial test)
- Independence between factors: ($\chi^2$ test)



In [77]:

    
##Some code to run at the beginning of the file, to be able to show images in the notebook
##Don't worry about this cell

#Print the plots in this screen
%matplotlib inline 

#Be able to plot images saved in the hard drive
from IPython.display import Image 

#Make the notebook wider
from IPython.core.display import display, HTML 
display(HTML("<style>.container { width:90% !important; }</style>"))

import seaborn as sns
import pylab as plt
import pandas as pd
import numpy as np

def read_our_csv():
    #reading the raw data from oecd
    df = pd.read_csv("../class2/data/CITIES_19122016195113034.csv",sep="\t")

    #fixing the columns (the first one is ""METRO_ID"" instead of "METRO_ID")
    cols = list(df.columns)
    cols[0] = "METRO_ID"
    df.columns = cols
    
    #pivot the table
    column_with_values = "Value"
    column_to_split = ["VAR"]
    variables_already_present = ["METRO_ID","Metropolitan areas","Year"]
    df_fixed = df.pivot_table(column_with_values,
                 variables_already_present,
                 column_to_split).reset_index()
    
    return df_fixed

2 Data visualization: A picture is worth a thousand words

Why do we visualize information?

It's easier to read than a table
We use it to:
- Communicate information
- Support our points

2.1 Example: Anscombe's quartet



In [3]:

    
#From Tufte "The visual display of information"
Image(url="images/tufle1.png")









    Out[3]:



In [125]:

    
sns.set(style="ticks")

# Load the example dataset for Anscombe's quartet
df = sns.load_dataset("anscombe")

# Show the results of a linear regression within each dataset
sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=df,
           col_wrap=4, ci=None, palette="muted", size=4, 
           scatter_kws={"s": 50, "alpha": 1})









    Out[125]:





<seaborn.axisgrid.FacetGrid at 0x7effa71098d0>

2.2 Principles of data visualization for quantitative information

You can use different channels
Some channels are easily interpreted by our brain
Some can be combined better than others

2.2.1 Channels to map information in a figure



In [19]:

    
#From http://www.cs171.org/2015/assets/slides/05-marks_channels.pdf
Image(url="images/channels.png",width=1000)









    Out[19]:

2.2.2 Relative errors of different channels



In [20]:

    
#From http://www.cs171.org/2015/assets/slides/05-marks_channels.pdf
Image(url="images/cleveland.png",width=1000)









    Out[20]:



In [16]:

    
#https://en.wikipedia.org/wiki/Stevens'_power_law
#From http://www.cs171.org/2015/assets/slides/05-marks_channels.pdf
Image(url="images/steven.png",width=500)









    Out[16]:



In [126]:

    
plt.figure(figsize=(4,3))
plt.bar([1,2],[1,3.5],width=0.3)
#plt.axis('off')
plt.yticks([1,2,3,4])
plt.xticks([])
plt.tick_params(axis='both', left='off', top='off', right='off', bottom='off', labelleft='off', labeltop='off', labelright='off', labelbottom='off')
plt.grid("on")
plt.title("How much bigger?")









    Out[126]:





<matplotlib.text.Text at 0x7effa6ef5160>



In [49]:

    
plt.scatter?



In [127]:

    
plt.figure(figsize=(4,3))
plt.scatter([1,1.1],[1,1],s=[500,1250])
plt.tick_params(axis='both', left='off', top='off', right='off', bottom='off', labelleft='off', labeltop='off', labelright='off', labelbottom='off')
plt.grid("on")
plt.title("How much bigger?")









    Out[127]:





<matplotlib.text.Text at 0x7effa6e47198>



In [133]:

    
plt.figure(figsize=(4,3))
plt.bar([1,2],[2*np.sqrt(0.5),3.5],width=[np.sqrt(0.5),1])
plt.yticks([1,2,3,4])
plt.xticks([0.5,1,2,3,3.5])
plt.tick_params(axis='both', left='off', top='off', right='off', bottom='off', labelleft='off', labeltop='off', labelright='off', labelbottom='off')
plt.grid("on")
plt.title("How much bigger?")









    Out[133]:





<matplotlib.text.Text at 0x7effa6c52dd8>



In [13]:

    
plt.figure(figsize=(4,3))
plt.bar([1,2],[2,2],width=[1,1],color=[(20/255,20/255,20/255),(100/255,100/255,100/255)])
plt.yticks([1,2,3,4])
plt.xticks([0.5,1,2,3,3.5])
plt.tick_params(axis='both', left='off', top='off', right='off', bottom='off', labelleft='off', labeltop='off', labelright='off', labelbottom='off')
plt.grid("on")
plt.title("How much darker?")









    Out[13]:





<matplotlib.text.Text at 0x7ff8840e0748>



In [15]:

    
plt.figure(figsize=(4,3))
plt.bar([1,2],[2,2],width=[1,1],color=[(0/255,74/255,235/255),(59/255,96/255,176/255)])
plt.yticks([1,2,3,4])
plt.xticks([0.5,1,2,3,3.5])
plt.tick_params(axis='both', left='off', top='off', right='off', bottom='off', labelleft='off', labeltop='off', labelright='off', labelbottom='off')
plt.grid("on")
plt.title("How much more saturation?")









    Out[15]:





<matplotlib.text.Text at 0x7ff88408c5f8>

2.2 Information about ONE quantitative variable

Distributions: Histogram, violinplot and box-plot
Mean values: Bar plot

2.2.1 DISTRIBUTIONS

The distribution is the relationship between the value and the frequency (value 7, frequency 100 times)
We usually plot the relative frequency (the fraction) instead of the value
When we do this, the area below the curve is equal to 1.
When are we interested on this:
- In the description phase of our research, to see how our data looks like
- To see if the assumptions of our statistics hold (histograms)
- Many times we want to plot the distributions (or a summary of it) of a quantitative variable in terms of a qualitative variable (for instance, distribution of GDP in many countries)



In [18]:

    
#Data example: If you draw two dice, then you will get a lot of 7s, many 6s and 8s, some 5s and 9s, a few 4s and 10st, very few32s and 11st and almost no 2s ans 12st.
#This data is discrete

from collections import Counter

#Roll two dices 10000 times
dice_rolls = np.random.randint(1,7,10000) + np.random.randint(1,7,10000)
#Count the number of each element to create the distribution
Counter(dice_rolls)









    Out[18]:





Counter({2: 279,
         3: 583,
         4: 838,
         5: 1136,
         6: 1318,
         7: 1648,
         8: 1388,
         9: 1118,
         10: 880,
         11: 538,
         12: 274})

2.2.1.1 HISTOGRAM

A representation of the visualization
It's not a good idea to have many of this



In [30]:

    
from scipy.stats import norm,lognorm,expon

#seaborn defaults
sns.set()
#And we can visualize it with a histogram
plt.figure(figsize=(4,3))

#Histogram
sns.distplot(dice_rolls, fit=norm, kde=False,rug=False,bins=range(2,14),norm_hist=True)









    Out[30]:





<matplotlib.axes._subplots.AxesSubplot at 0x7ff88281a7b8>



In [112]:

    
!conda update pandas -y









    



Fetching package metadata ...An unexpected error has occurred.
Please consider posting the following information to the
conda GitHub issue tracker at:

    https://github.com/conda/conda/issues



Current conda install:

               platform : linux-64
          conda version : 4.3.7
       conda is private : False
      conda-env version : 4.3.7
    conda-build version : 2.0.2
         python version : 3.5.2.final.0
       requests version : 2.12.4
       root environment : /opt/anaconda/anaconda3  (writable)
    default environment : /opt/anaconda/anaconda3
       envs directories : /opt/anaconda/anaconda3/envs
          package cache : /opt/anaconda/anaconda3/pkgs
           channel URLs : https://repo.continuum.io/pkgs/free/linux-64
                          https://repo.continuum.io/pkgs/free/noarch
                          https://repo.continuum.io/pkgs/r/linux-64
                          https://repo.continuum.io/pkgs/r/noarch
                          https://repo.continuum.io/pkgs/pro/linux-64
                          https://repo.continuum.io/pkgs/pro/noarch
            config file : /home/jgarcia1/.condarc
           offline mode : False
             user-agent : conda/4.3.7 requests/2.12.4 CPython/3.5.2 Linux/3.10.0-327.18.2.el7.x86_64 CentOS Linux/7.2.1511 glibc/2.17
                UID:GID : 200044586:513



`$ /opt/anaconda/anaconda3/bin/conda update pandas -y`




    Traceback (most recent call last):
      File "/opt/anaconda/anaconda3/lib/python3.5/site-packages/conda/exceptions.py", line 617, in conda_exception_handler
        return_value = func(*args, **kwargs)
      File "/opt/anaconda/anaconda3/lib/python3.5/site-packages/conda/cli/main.py", line 137, in _main
        exit_code = args.func(args, p)
      File "/opt/anaconda/anaconda3/lib/python3.5/site-packages/conda/cli/main_update.py", line 65, in execute
        install(args, parser, 'update')
      File "/opt/anaconda/anaconda3/lib/python3.5/site-packages/conda/cli/install.py", line 210, in install
        unknown=index_args['unknown'], prefix=prefix)
      File "/opt/anaconda/anaconda3/lib/python3.5/site-packages/conda/core/index.py", line 120, in get_index
        index = fetch_index(channel_priority_map, use_cache=use_cache)
      File "/opt/anaconda/anaconda3/lib/python3.5/site-packages/conda/core/index.py", line 445, in fetch_index
        repodatas = _collect_repodatas(use_cache, urls)
      File "/opt/anaconda/anaconda3/lib/python3.5/site-packages/conda/core/index.py", line 433, in _collect_repodatas
        repodatas = _collect_repodatas_serial(use_cache, urls)
      File "/opt/anaconda/anaconda3/lib/python3.5/site-packages/conda/core/index.py", line 401, in _collect_repodatas_serial
        for url in urls]
      File "/opt/anaconda/anaconda3/lib/python3.5/site-packages/conda/core/index.py", line 401, in <listcomp>
        for url in urls]
      File "/opt/anaconda/anaconda3/lib/python3.5/site-packages/conda/core/index.py", line 141, in func
        res = f(*args, **kwargs)
      File "/opt/anaconda/anaconda3/lib/python3.5/site-packages/conda/core/index.py", line 391, in fetch_repodata
        with open(cache_path, 'w') as fo:
    PermissionError: [Errno 13] Permission denied: '/opt/anaconda/anaconda3/pkgs/cache/db552c1e.json'

2.2.1.2 BOX-PLOT

A summary of the distribution.
They show if the means of two\ distributions are different or not



In [116]:

    
#And we can visualize it with a histogram
plt.figure(figsize=(4,2))
sns.boxplot(dice_rolls,orient="h")









    Out[116]:





<matplotlib.axes._subplots.AxesSubplot at 0x7effa7373160>

2.2.1.2 VIOLIN-PLOT

A summary of the distribution
Only makes sense for CONTINUOUS data



In [119]:

    
#And we can visualize it with a histogram
plt.figure(figsize=(4,2))
sns.violinplot(dice_rolls,orient="h",inner="quartiles")









    Out[119]:





<matplotlib.axes._subplots.AxesSubplot at 0x7effac085f28>

2.2 BAR-PLOT

When are we interested on this:
- We only are interested in one individual value either because
  - We only have the value (e.g. the number of car accidents)
  - We are not so interested in the distribution and we don't want to clutter the plot. E.g. the mean number of car accidents. We need error bars in this case!
This plot only make sense if we have many categories



In [141]:

    
plt.figure(figsize=(4,2))
#mean is the default
sns.barplot(dice_rolls,estimator=np.mean)









    Out[141]:





<matplotlib.axes._subplots.AxesSubplot at 0x7effa6aea470>

2.2 Information about ONE quantitative variable + ONE or more qualitative variables

Histogram, violinplot and box-plot
Bar plot



In [78]:

    
df = read_our_csv()
df["C"] = df["METRO_ID"].apply(lambda x: x[:2])
df = df.loc[df["C"] == "IT"]
df.head()









    Out[78]:






  
    
      VAR
      METRO_ID
      Metropolitan areas
      Year
      CO2_PC
      ENTROPY_1000M
      EQU_HOU_DISP_INC
      GDP_PC
      GINI_INC
      GREEN_AREA_PC
      LABOUR_PRODUCTIVITY
      PCT_INTENSITY
      POP_DENS
      SPRAWL
      UNEMP_R
      C
    
  
  
    
      1530
      IT001
      Rome
      2000
      10.36
      NaN
      NaN
      47836.13
      NaN
      251.93
      122766.44
      0.33
      651.06
      NaN
      11.05
      IT
    
    
      1531
      IT001
      Rome
      2001
      NaN
      NaN
      NaN
      48928.59
      NaN
      250.04
      124464.71
      0.38
      655.97
      NaN
      9.98
      IT
    
    
      1532
      IT001
      Rome
      2002
      NaN
      NaN
      NaN
      49323.37
      NaN
      248.15
      122641.81
      0.36
      660.98
      NaN
      7.93
      IT
    
    
      1533
      IT001
      Rome
      2003
      NaN
      NaN
      NaN
      48662.53
      NaN
      246.25
      120883.80
      0.46
      666.08
      NaN
      8.03
      IT
    
    
      1534
      IT001
      Rome
      2004
      NaN
      NaN
      NaN
      50172.53
      NaN
      244.34
      122254.22
      0.50
      671.29
      NaN
      7.47
      IT

2.2.1.1 BOX-PLOT



In [171]:

    
plt.figure(figsize=(6,4))
sns.boxplot(x="Metropolitan areas",y="GDP_PC",data=df,color="gray")
plt.xticks(rotation=45)
plt.show()

2.2.1.2 VIOLIN-PLOT



In [170]:

    
plt.figure(figsize=(6,4))

sns.violinplot(x="Metropolitan areas",y="GDP_PC",data=df,color="gray",inner="quartiles")
plt.xticks(rotation=45)
plt.show()

2.2.1.3 BAR-PLOT



In [169]:

    
plt.figure(figsize=(6,4))

sns.barplot(x="Metropolitan areas",y="GDP_PC",data=df,color="gray")
plt.xticks(rotation=45)
plt.show()

We can add an extra variable with hue (example for boxplot but it works for all)



In [79]:

    
df_2010_15 = df.loc[df["Year"].isin([2005,2010]),:]
plt.figure(figsize=(6,4))
sns.barplot(x="Metropolitan areas",y="GDP_PC",data=df_2010_15,hue="Year")
plt.xticks(rotation=45)
plt.show()



In [88]:

    
df_2010_15 = df.loc[df["Year"].isin([2005,2010]),:]
df_2010_15 = df_2010_15.sort_values(by="GDP_PC")
plt.figure(figsize=(6,4))
sns.barplot(x="Metropolitan areas",y="GDP_PC",data=df_2010_15,hue="Year")
plt.xticks(rotation=45)
plt.show()

2.3 Information about TWO quantitative variables

Scatter plot
Line plot
Heatmap

2.3.1 SCATTER PLOT

This is the most useful plot.

Used to visualize relationship between two variables



In [202]:

    
sns.lmplot(x="GDP_PC",y="UNEMP_R",data=df, fit_reg=False,size=4,aspect=1.4)
plt.show()

And we can add a trendline

default: Fit linear
order=2: Fit 2nd order polynomial
logx=True -> Fit exponential
robust=True -> Fit linear with outliers
lowess=True -> trend line
logistic=True -> fit logistic (y must be between 0 and 1)



In [217]:

    
plt.figure(figsize=(6,4))
sns.lmplot(x="GDP_PC",y="UNEMP_R",data=df, logx=True)
plt.show()









    





<matplotlib.figure.Figure at 0x7effa58988d0>



In [38]:

    
plt.figure(figsize=(6,4))
sns.lmplot(x="GDP_PC",y="UNEMP_R",data=df, lowess=True)
plt.show()









    





<matplotlib.figure.Figure at 0x7ff880a802b0>

And we can add the marginal distributions



In [220]:

    
sns.jointplot(x="GDP_PC", y="UNEMP_R", data=df,
              marginal_kws=dict(bins=20, rug=False, kde=True, kde_kws={"cut":0}),size=6,alpha=0.5)









    



/opt/anaconda/anaconda3/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j






    Out[220]:





<seaborn.axisgrid.JointGrid at 0x7effa5d66630>

And we can bin the data



In [99]:

    
sns.jointplot(x="GDP_PC", y="UNEMP_R", data=df,kind="hex",
              marginal_kws=dict(bins=20, rug=False, kde=True, kde_kws={"cut":0}), gridsize = 15,size=6)

plt.subplots_adjust(top=0.9)
plt.suptitle('THIS IS A TITLE, YOU BET') # can also get the figure from plt.gcf()









    



/opt/anaconda/anaconda3/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j






    Out[99]:





<matplotlib.text.Text at 0x7ff871450b00>

2.3.1 LINE PLOT

This is the most useful plot when the x variable is time.

Used to visualize relationship between two variables, one of them time



In [93]:

    
sns.tsplot(time="Year",unit="Metropolitan areas",value="UNEMP_R",data=df.reset_index(),estimator=np.mean)









    Out[93]:





<matplotlib.text.Text at 0x7ff872a0f6d8>



In [89]:

    
sns.tsplot?

2.4 Information about TWO quantitative variables + ONE or more qualitative variables

Scatter plot
Scatter matrix
Heatmap: Correlation matrix

2.4.1 SCATTER PLOT



In [224]:

    
sns.lmplot(x="GDP_PC",y="UNEMP_R",hue="Metropolitan areas",data=df, fit_reg=False,size=4,aspect=1.4)
plt.show()

2.4.2 LINE PLOT

This is the most useful plot when the x variable is time.

Used to visualize relationship between two variables, one of them time



In [73]:

    
sns.tsplot(time="Year",unit="Metropolitan areas",value="UNEMP_R",condition="Metropolitan areas",data=df.reset_index())
plt.savefig("annoying_legend.pdf")



In [74]:

    
#How to move the legend out
sns.tsplot(time="Year",unit="Metropolitan areas",value="UNEMP_R",condition="Metropolitan areas",data=df.reset_index())
sns.plt.legend(loc='center left',bbox_to_anchor=(1,0.5))









    Out[74]:





<matplotlib.legend.Legend at 0x7ff88010b128>

2.4.3 SCATTER MATRIX

This is useful to see the relationship between many variables



In [100]:

    
df_subset = df.loc[:,["Metropolitan areas","CO2_PC","GDP_PC","GREEN_AREA_PC","POP_DENS","UNEMP_R"]].dropna()



In [76]:

    
sns.pairplot(df_subset.dropna(),hue="Metropolitan areas")









    Out[76]:





<seaborn.axisgrid.PairGrid at 0x7ff880184668>

2.4.4 HEATMAP

This is useful to see the correlation between many variables



In [102]:

    
df_subset.head()









    Out[102]:






  
    
      VAR
      Metropolitan areas
      CO2_PC
      GDP_PC
      GREEN_AREA_PC
      POP_DENS
      UNEMP_R
    
  
  
    
      1530
      Rome
      10.36
      47836.13
      251.93
      651.06
      11.05
    
    
      1535
      Rome
      9.98
      50376.28
      242.42
      676.60
      7.35
    
    
      1538
      Rome
      9.11
      50748.42
      236.61
      693.20
      7.04
    
    
      1545
      Milan
      7.61
      53499.91
      24.21
      1458.80
      5.10
    
    
      1550
      Milan
      8.01
      55725.72
      23.59
      1496.90
      4.19



In [105]:

    
corr = df_subset.corr()
corr









    Out[105]:






  
    
      VAR
      CO2_PC
      GDP_PC
      GREEN_AREA_PC
      POP_DENS
      UNEMP_R
    
    
      VAR
      
      
      
      
      
    
  
  
    
      CO2_PC
      1.000000
      0.294134
      -0.124358
      -0.255233
      -0.414950
    
    
      GDP_PC
      0.294134
      1.000000
      0.485565
      -0.399848
      -0.827301
    
    
      GREEN_AREA_PC
      -0.124358
      0.485565
      1.000000
      -0.504489
      -0.425216
    
    
      POP_DENS
      -0.255233
      -0.399848
      -0.504489
      1.000000
      0.515049
    
    
      UNEMP_R
      -0.414950
      -0.827301
      -0.425216
      0.515049
      1.000000



In [106]:

    
corr**2









    Out[106]:






  
    
      VAR
      CO2_PC
      GDP_PC
      GREEN_AREA_PC
      POP_DENS
      UNEMP_R
    
    
      VAR
      
      
      
      
      
    
  
  
    
      CO2_PC
      1.000000
      0.086515
      0.015465
      0.065144
      0.172184
    
    
      GDP_PC
      0.086515
      1.000000
      0.235774
      0.159878
      0.684426
    
    
      GREEN_AREA_PC
      0.015465
      0.235774
      1.000000
      0.254509
      0.180809
    
    
      POP_DENS
      0.065144
      0.159878
      0.254509
      1.000000
      0.265276
    
    
      UNEMP_R
      0.172184
      0.684426
      0.180809
      0.265276
      1.000000



In [264]:

    
# Compute the correlation matrix
corr = df_subset.corr()

# Generate a mask for the upper triangle (hide the upper triangle)
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, square=True,linewidths=.5)

plt.show()

2.5 Network data

Use gephi: https://gephi.org/

This is what we do at the corpnet group (corpnet.uva.nl)



In [189]:

    
Image(url="images/newtork.png")









    Out[189]:

2.6 When to use log scale

Increase visibility (too many values with small values)
When we are plotting ratios or percentages (because a ratio of 5/1 and a ratio of 1/5 look equally far from 1 in log scale)
In a distribution: When we are trying to show that our distribution follows a exponential (lin-log scale), lognormal (log-lin scale) or power-law (log-log scale) distribution



In [111]:

    
plt.bar([1,2],[5,0.2])
plt.plot([1,3],[1,1])
plt.yscale("log")

VAR	METRO_ID	Metropolitan areas	Year	CO2_PC	ENTROPY_1000M	EQU_HOU_DISP_INC	GDP_PC	GINI_INC	GREEN_AREA_PC	LABOUR_PRODUCTIVITY	PCT_INTENSITY	POP_DENS	SPRAWL	UNEMP_R	C
1530	IT001	Rome	2000	10.36	NaN	NaN	47836.13	NaN	251.93	122766.44	0.33	651.06	NaN	11.05	IT
1531	IT001	Rome	2001	NaN	NaN	NaN	48928.59	NaN	250.04	124464.71	0.38	655.97	NaN	9.98	IT
1532	IT001	Rome	2002	NaN	NaN	NaN	49323.37	NaN	248.15	122641.81	0.36	660.98	NaN	7.93	IT
1533	IT001	Rome	2003	NaN	NaN	NaN	48662.53	NaN	246.25	120883.80	0.46	666.08	NaN	8.03	IT
1534	IT001	Rome	2004	NaN	NaN	NaN	50172.53	NaN	244.34	122254.22	0.50	671.29	NaN	7.47	IT

VAR	CO2_PC	GDP_PC	GREEN_AREA_PC	POP_DENS	UNEMP_R
VAR
CO2_PC	1.000000	0.294134	-0.124358	-0.255233	-0.414950
GDP_PC	0.294134	1.000000	0.485565	-0.399848	-0.827301
GREEN_AREA_PC	-0.124358	0.485565	1.000000	-0.504489	-0.425216
POP_DENS	-0.255233	-0.399848	-0.504489	1.000000	0.515049
UNEMP_R	-0.414950	-0.827301	-0.425216	0.515049	1.000000

VAR	CO2_PC	GDP_PC	GREEN_AREA_PC	POP_DENS	UNEMP_R
VAR
CO2_PC	1.000000	0.086515	0.015465	0.065144	0.172184
GDP_PC	0.086515	1.000000	0.235774	0.159878	0.684426
GREEN_AREA_PC	0.015465	0.235774	1.000000	0.254509	0.180809
POP_DENS	0.065144	0.159878	0.254509	1.000000	0.265276
UNEMP_R	0.172184	0.684426	0.180809	0.265276	1.000000